Libris Britannia 4

home *** CD-ROM | disk | FTP | other *** search

/ Libris Britannia 4 / science library(b).zip / science library(b) / MATHEMAT / STATISTI / H201.ZIP / AP30_H.AZF / HELP.HTT < prev

Wrap

Text File | 1993-12-25 | 318KB | 7,008 lines

Arcus Pro-Stat Help: |CONTENTS| ¬<Introduction>╪496 ¬ ¬<Basics>╪17430 ¬ ¬<Data Management>╪35334 ¬ ¬<Database Manager>╪59749 ¬ ¬<Analysis>╪77148 ¬ ¬<Algebraic Calculator>╪288806 ¬ ¬<Setup>╪9267 ¬ ¬<Technical Information>╪3186 ¬ ¬<Appendices>╪292864 ¬ ¬<Reference List>╪310584 ¬ ¬<Help>╪297385 ¬ This is the hypertext help system for Arcus Pro-Stat version 3. If you are not sure how to use this system then please press F1 now. |Introduction| Arcus is a general statistical analysis package which has been developed for use in biomedical research. It has also found popularity in education and many branches of commerce. The Arcus project was started because the aims listed below were not met by any other software package for the PC. Arcus has now developed a style of its own and a world wide reputation for making statistical analysis more approachable. As we develop the Arcus project the following aims continue to direct our work. 1. A collection of the most commonly used statistical procedures built on robust modern methodology to achieve accuracy and to avoid the compromise of approximation wherever possible. 2. A user friendly approach which is intuitive and which requires little reference to printed literature. 3. A detailed coverage of the statistical procedures which are done badly or not at all by other statistical packages. 4. A toolbox of basic statistical procedures which are useful in research but are seldom found in easily accessible forms in other statistical packages. 5. A project for which the primary objective is not financial but is a dedication to the excellence of the product. This project is to be supported indefinitely. Since the conception of the Arcus project in 1988 there has been a commitment to provide facilities which the users request and, most importantly, to present these facilities in a way which is user friendly. These objectives are often difficult to apply to statistical analysis but after much consultation with Arcus users it has been possible to develop interfaces which are intuitively simple to use. As a registered user you are now entitled to submit suggestions for the development of the Arcus project. If you are a member of an organisation which has a site licence for Arcus then please make your suggestions through one representative. If you have any problems with this software or suggestions of new features for future versions then you are most welcome to write to us. Please make clear reference to published literature in all correspondence concerning statistical calculation or inference. Newsletters keep the Arcus user informed of developments in the project and you are invited to submit articles concerning any aspect of statistical analysis, computing or your application of Arcus. All correspondence should be sent to: Dr Iain E. Buchan, Medical Computing, 83, Turnpike Road, Aughton, West Lancashire, L39 3LD. UNITED KINGDOM Tel (0)695 424 034 Fax (0)51 256 7001 |Technical Information| Arcus requires at least 448k of free memory (i.e. 640k + disk based DOS) in a 286 or better system running or emulating MS DOS version 3.30 or later. MS DOS version 5 and above enhances Arcus Pro-Stat by providing more memory and executing the code faster than pervious versions of DOS. If you have extended memory configured as expanded memory using a driver such as EMM.EXE or EMM386.EXE then Arcus Pro-Stat will use this to improve the overall efficiency of the package. Further enhancements in operation speed are afforded by using a disk cache system such as SMARTDRV.SYS supplied with MS DOS. The number of data points which Arcus can hold at any one time is memory which your computer has free. This is reflected in the storage capacity of the worksheet. When you start an Arcus session the number of cells which the worksheet can contain is a function of the amount of addressable free memory divided between 50 columns. You can reset the column limit and then the maximum number of rows is determined by free memory. The total data storage capacity is greatest on a well configured 486 or Pentium with expanded memory. Arcus Pro-Stat will run faster in the presence of a mathematical co-processor because the burden of floating point maths is taken away from the program code which emulates a co-processor in the absence of one. Some calculations and sorting/ranking procedures will run up to five times faster. 486 DX and Pentium systems have floating point co-processors as standard. Please note that Arcus now requires at least a 286 processor. It will not run on old 8086, 8088, V20 or V30 systems. Microsoft's mouse driver (MOUSE.COM) is supplied on installation disk one, this should be tried if you experience problems with your existing mouse driver software. Arcus graphics screen modes are selected by an internal system analysis routine (Autoselect) but this may be overridden by an option in the ¬setup╪9267 ¬ menu. Due to the wide diversity of video cards available Arcus can not be guaranteed to display every screen perfectly but it has been tested with CGA, EGA, VGA, MCGA and Hercules. If you have any problems with Arcus graphics then try using different user defined screen selections. In order to display Arcus graphics with a Hercules monochrome graphics adapter you will need to have loaded the MSHERC.COM program before starting the main Arcus program. Install handles this for you by inserting the line MSHERC.COM into the ARCUS.BAT file which loads Hercules support routines when a Hercules monochrome adapter is detected. The graphics provided in Arcus can be used for presentation if you have a PostScript printer. The other printer options, Hewlett Packard Laserjet and Epson FX, are simple screen dumps which are intended for instant visual analysis only. If you do not have a PostScript compatible printer then you can save Arcus PostScript graphics files to disk and have them printed out on a PostScript system at a later date. Most results screens, including the pictorial statistics selections which are marked with a hash(#) in the menu, use only standard ASCII characters so that you can obtain a hard copy using any line printer. This is achieved by pressing P or E when results are displayed. Please do not use the print screen key. Once you have pressed P or E you enter the Arcus screen editor; the screen will turn to inverse video (black on white) and you have an opportunity to annotate the results before they are sent to the printer or to a log file on disk (please refer to ¬Basics╪17430 ¬). The printing routines are designed to keep a paper record of the work done in your Arcus work sessions and they operate most efficiently with continuous or sheet-fed stationery. For uninterrupted output please be sure to set the lines per page option in the setup menu, this defines the number of lines which your printer fits on one page. If you experience a problem of Arcus Pro-Stat "hanging up" (i.e. no response from the keyboard) then please make sure that you have avoided the following situations. Firstly you must not use Arcus Pro-Stat on a computer which runs anything less than a 286 processor. Secondly you must remove unnecessary TSR (terminate and stay resident) programs before running Arcus Pro-Stat. Very few TSR's cause problems but I have come across some rogue public domain and early freebie system utilities which cause problems with code that conforms strictly to Microsoft standards. Examples of these rogues are KEYBUK.EXE and SPEED.SYS. Please use the MS DOS KEYB.COM routine in place of KEYBUK.EXE and do not use SPEED.SYS. Please do not use any non-standard DOS components, especially replacements for COMMAD.COM. One DOS component which can cause strange looking screens is ANSI.SYS and the MODE.COM PAGE settings. Try removing these from the CONFIG.SYS file, they are not used by good software and they take up memory. If you are a Microsoft Windows user then please note that you can use the clipboard to paste results screens to other applications if you have installed Arcus Pro-Stat as DOS application in Windows running in enhanced mode. Please remember that Arcus must be started via the ARCUS.BAT batch file, therefore you must specify ARCUS.BAT as the command line when installing Arcus as a DOS application in the Windows environment. DO NOT LET WINDOWS INSTALL ARCUS AS A DOS APPLICATION WITH THE COMMAND LINE ARCUS_.EXE, IT MUST BE ARCUS.BAT INSTEAD! Arcus Pro-Stat takes advantage of some memory management features in Windows even though it is run as a DOS application. Arcus Pro-Stat has been developed using Microsoft FORTRAN version 5.1, Microsoft BASIC Professional Development System version 7.1 and Microsoft Macro Assembler with all complied code linked by Blinker version 3.0. All executable code conforms to LIM (Lotus Intel Microsoft) standards and will take advantage of LIM 4 expanded memory if present. |Setup| ¬<Data File Path>╪10092 ¬ ¬<Printer Port>╪10623 ¬ ¬<Lines per Page>╪10854 ¬ ¬<Graphics Printer>╪11630 ¬ ¬<Graphics Screen>╪12540 ¬ ¬<Mouse Sensitivity>╪12882 ¬ ¬<Screen Colours>╪13127 ¬ Some information about your computer hardware and preferences is kept in memory for Arcus to refer to. This information is stored in a setup file called ARCUS.SET which you will find in the Arcus program directory. Do not attempt to alter this file externally. All setup information is configured via the setup menu. When you are happy with the information you have specified then you can update the ARCUS.SET file by selecting "save new settings". If the ARCUS.SET file is accidentally lost then you are forced through this setup procedure when you being an Arcus session. |Data File Path| This is the disk location where Arcus worksheet files are to be stored. If you followed the default installation procedure on hard disk drive C then this location will be C:\ARCUS\DATA. Using the \DATA sub-directory off the \ARCUS directory is logical, you are advised to keep your hard disk structure as simple as possible. There are, however, circumstances such as network use when you would rather use a removable disk for data storage. If this is the case then simply enter the drive path A:\. |Printer Port| This refers to the parallel printer port that you want to use for Arcus print -outs. Most computers have at least one of these ports, designated LPT1, LPT2 etc. You can not select a serial port (COM1 ect). |Lines per Page| This tells Arcus how many lines of text your printer fits on one page. It will vary with, font, line spacing and paper size. Choose the lines per page figure which is appropriate to your printer when first switched on. If you do not set this information properly then Arcus will put page breaks in the wrong place. This will cause printing over perforations or odd looking sheet fed print-outs with large gaps. If you are a Laserjet user then you can select the number of lines per page on the printer as well as in Arcus setup. You are advised to select a small font so that you save paper. If you are a PostScript user then forget this option, it is set automatically for you when you select PostScript as the graphics printer type. |Graphics Printer| Arcus treats printed graphics in one of two ways. The first is a simple screen dump for instant visual analysis only and the second is high quality output for presentation. The only target for presentation quality graphics is PostScript. PostScript was chosen for Arcus as it is a portable and versatile language. PostScript output from Arcus can be directly to a printer or to an encapsulated PostScript file (EPS) on disk. You can use this EPS file as a graphic figure in most word processing documents intended for a PostScript printer. Simple screen dumps are provided for Hewlett Packard Laserjet and Epson FX compatible printers. You can select the resolution and orientation of the output. Please remember that these screen dumps are not intended for presentation, if you need presentation output then please consider a PostScript cartridge for your Laserjet. |Graphics Screen| Arcus can detect the best setting for most graphics adapters when you have set this option to "Autoselect". There are, however, exceptions so you are given the option of forcing Arcus to use a particular graphics mode. You can not use a video mode if it is not supported by your video card (see hardware manual). |Mouse Sensitivity| This sets the amount of mouse movement needed to shift the cursor. Thus a low setting reqires less movement of the mouse to move the cursor i.e. more sensitive. Settings are 1 to 100, most rodents prefer around 20. |Screen Colours| This section enables you to select colours for various categories of text. Some Arcus screen colours can not be changed. A black background has been chosen quite deliberately, this is to minimise the ambient radiation. Radiation from monitors may not prove to be a significant problem but why take the chance? |Save New Settings| This saves the settings in the rest of this section to a file called ARCUS.SET. Unless you save your settings in this file they will not take effect next time you start Arcus. |Return to Previous Menu| This is the "step back" button in the Arcus menu system. It is also achieved by pressing the Esc key or the right mouse button. |Windows| If you are a Microsoft Windows user then you should consider running Arcus as a DOS application from within Windows. When running Arcus within Windows in 386 enhanced mode you can paste Arcus results screens into the Windows clipboard for subsequent use in Windows applications. This is done by pressing Alt + Enter when Arcus is running, at this point you have a window of Arcus within the Windows environment. You might find the best results with the font set at 10 x 16. From the pull down menu of this window you select edit and copy to grab marked text or graphics from the Arcus window. This is then available in the clipboard for pasting into Windows applications. You can run Arcus Pro-Stat from a window within Windows but this is not advisable as it slows down all screen writing processes. You can not initiate Arcus graphics when running Arcus in a window within Windows, this requires full screen operation. See also "¬Technical Information╪3186 ¬". |DOS Shell| This option provides acess to all of your other programs without loosing any of the Arcus information you are working with. The memory overhead is just 4k bytes therefore you have enough memory left to run virtually any application. When you select this option you are presented with the DOS prompt from which you can issue all of the commands you could before you started Arcus. To return to the current Arcus session you simply enter EXIT at the DOS prompt. ¬<Windows>╪13832 ¬ |Developer's Notes| Running Arcus in a Shell: Free memory required = at least 384k Calling convention = ARCUS.BAT (NOT ARCUS_.exe !!!) Expanded memory = desirable but not essential (used for overlays) Automatic file loading: You can export text files from your application and execute Arcus with the exported information already loaded and the starting position within Arcus already defined. Arcus files have the following structure: Z%,"date of saving","description of contents" "name of column 1", J1% "name of column 2", J2% dc1r1! dc1r2! dc1r3! dc2r1! dc2r2! dc2r3! Key: Z% = number of variables (columns in worksheet, above it would be 2) JX% = number of data (rows) in worksheet column X, as an integer dcxry! = datum for column x, row y, as a single precision real number (thus the data are read down and columns across the sheet from left to right) The following is an actual Arcus worksheet file: 3,"29-03-1993","Arcus sample file" "col 1 ",3 "col 2 ",3 "col 3 ",3 1 2 3 1 2 3 1 2 3 This file follows the following structure: number of variables, date, description of file name of variable, number of data in variable repeat for no of variables... data read down each variable in turn... To start an Arcus session with a file called TEST already loaded you would use the command line ARCUS TEST or ARCUS.BAT TEST. Please note that the opening credit screen is skipped if you opt for automatic file loading on start up. The full command line options are: ARCUS /F$ /X% /R% /L$ Key: F$ = file to load on starting X% = code for starting locus 9 = data management menu 12 = worksheet 1 = analysis menu 0 = main menu R% = current printer row L$ = log file name If you have any questions then please do not hesitate to contact me: Iain E. Buchan, Medical Computing, 83, Turnpike Road, Aughton, West Lancs L39 3LD. TEL UK (0)695 424 034 FAX UK (0)51 256 7001 The |Basics| The Arcus user interface consists of plain text on a dark background. Menu selections are text icons of keys which can be pressed to select those menu items. Alternatively the cursor keys or a mouse can be used to move the highlighted menu selection to the required item which is then selected by pressing the enter key or the left mouse button. The menu system is a branching structure. Moving backward to a previous menu is achieved by pressing the escape key, selecting its icon or by pressing the right mouse button. The mouse options will only work if a mouse is present, mouse driver software is active and the mouse sensitivity has been defined in the setup menu. The menu system is accompanied by a context sensitive hypertext help system. Help screens are called up by pressing F1 or the middle mouse button (if you have a three button mouse). Each help screen is relevant to the menu item which is currently highlighted. A "Statistical Method Selection" section also provides information within Arcus. This facility will attempt to find the best test for your data but please remember that it is not a panacea of statistical methodology (ref 2). If you have any doubt about the best method for your data you should try to consult a statistician and you should most certainly consult a reputable text book. This hypertext manual discusses the functionality of Arcus Pro-Stat but gives only a brief outline of the statistical methods used. For further statistical information I recommend that you seek out the references listed as Core Texts in the ¬reference list╪310584 ¬. A list of good introductory texts is also provided in the reference section. ¬Confidence intervals╪31897 ¬ (CI) are increasingly used in statistical inference. Particular effort has been made to allow Arcus to address this valuable trend. Wherever possible the most exact method for the CI has been used. Before calculation of a CI a screen is displayed to enable you to select a coefficient of confidence. Short-cut key strokes are given for the commonly used confidence levels, for example pressing the enter key will set a 95% confidence level for the calculation which follows. You are also given the opportunity to enter your own confidence coefficient. Some of the Arcus functions are time consuming. When a process is taking an appreciable amount of time you are usually given a warning message. Please do not assume that the program has "crashed", this is highly unlikely. The most time consuming functions are the Lotus work file link, the calculation of exact probability for the Mann-Whitney U statistic in the presence of tied data and sample sizes for the comparison of means. Hard copies of results from a printer are obtained by pressing the P key when results screens are displayed. The ¬setup╪9267 ¬ menu and printer must be carefully configured. A flexible print routine, the Arcus screen editor, is invoked by pressing P or E when results screens are displayed. This allows you to annotate a text screen then send the results to a printer or to a log file on disk. The screen editor accepts standard edit key combinations: Ctrl+N Insert a line Ctrl+Y Delete a line Ctrl+P Embed a character Ctrl+Page Up Move to top of text Ctrl+Page Down Move to bottom of text If you save your results to a log file then you have a text file of results from the current Arcus session on disk. This text file can be examined and printed subsequently using the log file editor listed in the data management menu, it can also be imported by word processing software. The name of the log file is composed of the day, the month, the number of the Arcus session on that day (one log file per session) and it is given the extension ".LOG". Throughout Arcus the word variable is used to refer to a column of numbers in the worksheet. These columns represent groups of data which can be investigated via the analysis section. Any of the analyses which do not require columns of data from the worksheet are listed under the Instant Functions section. This section includes distribution functions and methods for contingency table analysis. |Essentials| Arcus aims to provide a user-friendly interface to statistical methods. This aim presents two major hurdles, the first is the ease of use of the software itself and the second is the level of assumed knowledge of statistics. The development of Arcus has focused on providing basic statistical methods in an intuitively simple package. One could say that statistical software should not be used by people who do not understand "statistics" and therefore justify a high level of assumed knowledge in statistical software. We do, however, live in the real world where people forget statistical principles learned in the past but need to apply them to their research. If Arcus can facilitate appropriate statistical design, analysis and inference by combining text and tools then the Arcus project will have achieved its objectives. If you are an experienced statistician then you will find useful functions in Arcus which are absent or awkward to use in other statistical packages. It is quicker to process data using Arcus on your desktop and only resort to SAS or Genstat etc if you need a function which is not covered by the present version of Arcus. If you are an infrequent user of statistical methods then here is an approach you might find useful. Consider your research as a sequence of actions: planning, data collection, data preparation and description, further analysis and presentation. You are the expert in the questions you are investigating so you MUST think long and hard about these questions BEFORE you start the research. Then consider how you can analyse any data you collect. Ask yourself, will I be able to answer the questions I am asking or does my study leave itself open to criticisms such as too many confounding variables?. In this situation you might need more control over your experimental conditions if this is possible. Sample size estimation is a difficult area for the uninitiated, Arcus provides sample size calcualations but I would advise you to seek statistical advice at this stage. A short time with a statistician at the planning stage can save a lot of misdirected time and effort in the long run. BEFORE you see the statistician you must have thought carefully about the nature, collectability, controllability and appropriateness of the data you plan to collect. If you go prepared you will get better answers faster. There is a statistical method selection section within Arcus but it deals with only the most basic statistical analyses. You are asked a series of questions about your study and you are given the most appropriate hypothesis test to use provided you are asking one of the simple questions covered by this section. Remember that the most simple questions often provide the most powerful answers. In some ways this function is an over simplification and you MUST NOT rely upon it for planning important studies. It is, however, useful for preparing yourself before you see a statistician. It will get you thinking along the right lines and thus make it easier for you to communicate your ideas to the statistician. Once you have a basic plan of action you can start to prepare your data for entry into Arcus. You have three main options 1) make a database 2) put data directly into the worksheet 3) put data directly into non-worksheet functions. The latter refers to simple situations such as the contingency tables in the instant functions section of Arcus. More arduous number entry tasks are made easier by using a keyboard with a number pad. Most Arcus users will enter their data into the worksheet. This involves preparing columns of numbers where each column represents a different group. For more information please see ¬<Arcus Worksheet>╪36264 ¬. Please note that the help text for each analysis function gives you information on how to prepare your data. Some users might wish to make a database from which they can select information for export to the Arcus worksheet. This is often the easiest approach to questionnaires. For more information please see the ¬<Database Manager>╪59749 ¬. The next stage is to look at your data. Are there any odd looking results and if so, why are they odd? Then describe your data using ¬<Descriptive Statistics>╪80612 ¬. If you are happy with the questions you were asking before you started the study then go on to apply the hypothesis test which you planned at the outset. It may be that there is no appropriate "test", you should establish your analytical plan at the start of the study, taking statistical advice if necessary. NEVER sift through various tests trying to get p<0.05 this is not difficult to detect and makes you look very unprofessional. If you do not understand why this is so then please see ¬╪29175 ¬. The inferences you make from your statistical analyses require knowledge of both the statistical principles used and the biological relevance of the numerical conclusions. The last step is presentation. You might have a well conducted and well analysed study which falls down on presentation. Here are a few basic pointers: Present raw data where possible, use graphs if they can show something important, do not duplicate data (e.g. tables and text), do not present parametric and non -parametric descriptive statistics together, use the asterisk rating system for ¬╪29175 ¬ and use ¬<confidence intervals>╪31897 ¬ in discussion. Summary: Think long and hard about questions ± Try Arcus statistical method selection ± Try Arcus sample size calculations ± Consult a statistician See the help text for the chosen Arcus functions Analyse and save results to a log file and/or paper output ± Transfer data to a graphics package Prepare report quoting Arcus version number and references |Interacting with Arcus| Arcus uses a plain text screen with a title bar at the top. The menus are lists of keys which you can press to select a menu title if you do not have an easier way of selecting menu items. This occurs with some portable computers where the cursor keys are awkwardly placed. If you have a good keyboard then select menu items using the cursor keys and the enter key. The escape key moves you back a menu. If you have a mouse then move the highlighted menu selection using the mouse and accept your selection by pressing the left mouse button. The right mouse button moves you back a menu. Within an Arcus menu you can acess special functions using keys which are not displayed on the screen: F1 or Alt+H calls up help text that is relevant to the currently highlighted menu title. Alt+P or Alt+E in the help system or results screens invokes the Arcus screen editor which can be used to annotate text screens then print them or save them to the active log file. Alt+N calls up the Arcus notepad on which you can jot down ideas and save them to the active log file or to the printer. If you are having problems with your mouse then please make sure that you are using a standard mouse driver such as Microsoft's MOUSE.COM or MOUSE.SYS. You do not have to use the mouse driver software which came with your mouse. Microsoft's MOUSE.COM is supplied on Arcus installation disk one. |P Values| The p value or critical level is the probability of rejecting the null hypothesis (Ho) when it is true. The null hypothesis is most often the hypothesis of "no difference" e.g. no difference between mean blood pressure in group A and group B. This should have been considered before the start of your study. If you expect results to be in one direction only then you have a one tailed test. More often you can not be certain that the results can go in one direction only, you must therefore use a two tailed p value. If your p value is less than the chosen significance level then you reject the null hypothesis i.e. accept that your sample gives reasonable evidence of a population difference for the parameters you have observed. It does NOT imply a "meaningful" or "important" difference, that is for you to decide when considering the biological relevance of your result. The choice of significance level at which you reject the Ho is arbitrary. Traditionally the 5%, 1% and 0.1% (p< 0.05, 0.01 and 0.001) regions have been used. These numbers tend to give a false sense of security when in reality there are many factors which contribute to the arbitrary nature of these levels. In the ideal world we would be able to define a "perfectly" random sample, the most appropriate test and one definitive conclusion. We simply can not. What we can do is try to optimise all stages of our research to minimise sources of uncertainty. When presenting p values it is good practice to use the asterisk rating system: p < 0.05 * p < 0.01 ** p < 0.001 *** Some authors quote statistically significant as p < 0.05 and statistically highly significant as p < 0.001. The asterisk system conveys more information and avoids the woolly term "significant". At this point, a word about error. Type I error is the false rejection of the null hypothesis and type II error is the false acceptance of the null hypothesis. As an aid memoir: think that our cynical society rejects before it accepts. The significance level (α) is the probability of type I error. The power of a test is one minus the probability of type II error. Power should be maximised when selecting statistical methods. If you want to estimate sample sizes then you must understand all of the terms I have mentioned here. You might be interested in further details of probability and sampling theory at this point. There are a number of good ¬introductory texts╪310834 ¬. You must understand ¬confidence intervals╪31897 ¬ if you intend to quote p values. You are encouraged to quote confidence intervals by all good journals. |Confidence Intervals| A confidence interval (CI) for a population parameter is the interval in which the unknown true population value for this parameter is assumed, with a certain probability, to lie. This probability is arbitrary, 95% (0.95) is the most commonly chosen value. The parameter in question can be a mean, difference between two means, a proportion etc. The CI included with each Arcus function is discussed in the help text for that function. The interval is often symmetrical about the parameter but this is not necessarily so. In some studies wider or narrower confidence intervals will be required. This rather depends upon the nature of your study. I would advise you to consult a statistician if you plan to use "non-standard" CI's. A word about terminology: You will hear the terms confidence interval and confidence limit used. The confidence interval is the range Q-X to Q+Y where Q is our parameter and Q-X is the lower confidence limit and Q+Y is the upper confidence limit. |Julian Numbers| The Julian period began on January 1st 4713 BC. The Julian number of a date represents the number of days since the start of the Julian period. These numbers are a useful way of representing dates because the arithmetic difference between two Julian numbers is the exact number of days between the two dates they represent. The Gregorian calendar which we use provides no year zero between 1 BC and 1 AD, so Julian number 1 corresponds to the 25th of November 4713 BC. Please note that you can use BC dates in the worksheet when it is in date mode but you can not use BC dates in the Arcus database manager. |Statistical Method Selection| This section provides a simple decision tree for selecting statistical methods appropriate to your data. Please note that the advice given is only a rough guide to methods appropriate to your investigation. Only the simpler experimental designs are covered. If you require a fuller appreciation of the statistical methods that are appropriate to your investigation then you are strongly advised to consult a reputable text or a statistician. A common fault is to read an article which is related to your work and repeat the methods that have been used by the authors; do not assume that all journals weed out bad statistical methods! ¬<Measurement Scales>╪34375 ¬ ¬<Essentials>╪21747 ¬ ¬<Analysis>╪77148 ¬ ¬<Reference List>╪310584 ¬ |Measurement Scales| Before you plan the statistical approach to your investigation you must understand the nature of the variables you are studying. Different variables have different mathematical characteristics which usually require different types of analysis. Please familiarise yourself with the following measurement scales: INTERVAL ■ Scale with a fixed and defined interval eg temperature or time. ORDINAL ■ Scale for ordering subjects from low to high with any ties attributed to lack of measurement sensitivity eg. pain score from a questionnaire. NOMINAL with order ■ Scale for grouping into categories with order eg. mild, moderate or severe. This can be difficult to separate from ordinal. NOMINAL without order ■ Scale for grouping into unique categories eg. blood group. DICHOTOMOUS ■ As for 4 but two categories only eg. surgery / no surgery. ¬<Essentials>╪21747 ¬ ¬<Reference List>╪310584 ¬ |Data Management| ¬<Arcus Worksheet>╪36264 ¬ ¬<Worksheet files>╪44501 ¬ ¬<ASCII & Lotus link>╪51843 ¬ ¬<Log file editor>╪57897 ¬ Most Arcus analyses require data which have been prepared in rows and columns. This section provides you with a worksheet with which to edit these data and other functions which import / export data to / from the worksheet. There is also a complete database management system which can be used to edit data in "forms", this is often the easiest approach when processing questionnaire data. If you need to process small numbers of data, such as contingency tables, then you do not need to enter these data via the worksheet. All such functions , which are listed in the analysis menu under "Instant Functions", ask you for the data they require after you have selected the function. These data are entered directly in response to instructions on the screen. |Arcus Worksheet| The Arcus worksheet can be thought of as a computerised sheet of paper which holds numbers in rows and columns. This is, however, a rather advanced sheet of paper with many editing functions and the ability to interpret formulae as you enter them. Superficially this worksheet resembles many of the well known spreadsheets but there are some important differences. Unlike spreadsheets the Arcus worksheet has been optimised for the preparation of data for statistical analysis. It does not hold any character data apart from the column labels. You may enter formulae in a cell (an individual element of a column) but these formulae are immediately translated into their numeric results. If you want to transform all of the data in a column by applying a formula to them then simply press Alt+F. Likewise if you need to create a new column of data as a function of one or more other columns then you can do so by pressing Alt+Q. The cursor control keys have the following actions in the worksheet: arrow right - go one cell to the right arrow left - go one cell to the left arrow up - go one cell up arrow down - go one cell down home - go to top of current column ctrl + home - go to top of the first column end - go to the last entry in the current column ctrl + end - go to the top of the last column which contains data Alt + G - go to the column name of your choice Unlike most spreadsheets the Arcus Worksheet uses the mouse as a pure cursor locator. There are no scroll bars to aim at you simply move the cursor using the mouse and the sheet will shift across if you move past the limit of the screen. When you try to move the cursor beyond the limit of the sheet itself you will see a red "LIMIT" sign flash at the cursor location. If you try to move past the right hand limit of the worksheet then you will be asked whether or not you wish to extend the worksheet by another column. If there is not enough memory available to extend the worksheet in this way then the operation is aborted with a beep. If you start a new sheet knowing that you require more than the standard 50 columns then you can extend the worksheet to a specified number of columns using the ¬Reset Parameters╪50354 ¬ selection of the data management menu. The maximum number of columns per worksheet is 1,000 and the row limit is 25,000. Please note that resetting the column limit to a small number increases the maximum size of each column. Numbers are entered in the worksheet by pressing any combination of alphanumeric keys followed by the enter key. You can enter numbers or formulae at the cell editing line. For example 8/SQR(16) would put the solution 2 into that cell. These formulae are for instant interpretation only, you can not embed them in a cell of the worksheet and you can not use other cell locators (e.g. A1 for column 1 row 1) as used by most spreadsheets. The functions which the cell editor can interpret are listed below and this information is available in a help screen which is invoked by pressing the F1 key when you are editing a cell. Constants: PI EE as e ABS absolute value CLOG common (base 10) logarithm CEXP anti log (base 10) EXP anti log (base e) LOG natural (base e, Naperian) logarithm SQR square root ! factorial (max 34) LN! log factorial IZ normal deviate for a p value UZ upper tail p for a normal deviate LZ lower tail p for a normal deviate ^ exponentiation (to the power of) + addition - subtraction * multiplication / division \ integer division ARCCOS arc cosine ARCCOSH arc hyperbolic cosine ARCCOT arc cotangent ARCCOTH arc hyperbolic cotangent ARCCSC arc cosecant ARCCSCH arc hyperbolic cosecant ARCTANH arc hyperbolic tangent ARCSEC arc secant ARCSECH arc hyperbolic secant ARCSIN arc sine ARCSINH arc hyperbolic sine ATN arc tangent COS cosine COT cotangent COTH hyperbolic cotangent CSC cosecant CSCH hyperbolic cosecant SINH hyperbolic sine SECH hyperbolic secant SEC secant TAN tangent TANH hyperbolic tangent AND logical AND NOT logical NOT OR logical OR < less than = equal to > greater than If you enter a cell in a column which has empty cells above the current location then the gaps above are automatically filled with missing data values. The worksheet editing mode is indicated by a "Norm" or "Date" sign at the top left hand corner. The date editing mode allows you to enter conventional dates in the European day/month/year format. These entries are stored as Julian integers in the worksheet but the highlighted cursor location always shows the conventional date interpretation of the Julian number. Please note that the difference between two Julian numbers is the exact number of days between the two dates from which these numbers are derived. The saving and loading of worksheet data to/from disk takes place outside the worksheet itself. You will see the relevant functions listed in the data management menu under ¬Worksheet Files╪44501 ¬. Labelling of columns is achieved using the key combination Alt+N or Alt+L. Other special keys which are active in the worksheet are: F1 help screen Alt+I insert a cell at the current cursor location Alt+C insert a column at the current cursor location Alt+D delete the cell at the current cursor location Del delete the cell at the current cursor location Alt+X delete the current column Alt+Z delete the current row Alt+N enter or edit a column name {When you are editing column names you can press TAB / Shift+TAB to move directly to the next / previous column name.} Alt+B copy a block from the current column to another column Alt+T toggle between normal and date editing mode Alt+P print all rows of selected columns Alt+S display current column statistics Alt+G go to a selected column Alt+F apply a formula to the current column Alt+R put ¬random numbers╪254271 ¬ into the current column Alt+F apply a formula to the current column Alt+Q make a new column as a function of other columns Space bar enter a missing data value (3.456789E+33 displayed as *) As in most of Arcus the mouse buttons emulate the enter and escape keys. Thus the right mouse button (Esc) exits the worksheet and the left mouse button (Enter) accepts any data you have typed at the current cell then moves down a cell. Some spreadsheets move the cursor to the right when you press enter but Arcus moves down. This is quite deliberate as most people prefer to enter numeric data in columns not rows. A word about indicator variables. Arcus uses indicator variables for survival analysis. All other functions require you to provide data from different groups in different columns. Some stats packages such as SAS use a column of 1's, 2's etc to indicate which group the entry in that row of the data column belongs to. This is the indicator variable system which Arcus uses for survival analysis. All other functions ask you for a separate column of data for each group. Arcus uses 3.456789E+33 as its missing data value and in all instances this is displayed as an asterisk (*). This is an internal constant which you do not need to remember, a cell within the spreadsheet is marked as a missing observation by pressing the space bar. In subsequent calculations these values are skipped and all values in a row containing a missing data value are skipped if the variables are grouped, e.g. matched pairs. |Worksheet Files| This section enables you to retrieve worksheet data which have been stored on disk using the ¬Save Worksheet╪47791 ¬ function of this menu. The standard location for Arcus worksheet files is a sub-directory called \DATA\ off your Arcus directory. If the standard setup has been used for an Arcus installation on drive C then the full data file path is C:\ARCUS\DATA. Arcus worksheet files do not use any special extension (the letters after the point in the file name). You can use any naming system you want. These worksheet files also have a very simple structure, they are stored in ASCII text. This simple structure has the benefit of enabling other developers to read and write Arcus worksheet files easily. This allows other applications, such as custom databases, to select data for export then write them into an Arcus file. If you are a developer then please see ¬Developer's Notes╪15349 ¬ for more information about the Arcus worksheet file structure. You can load more than one worksheet file from disk into the current worksheet. This enables very large worksheets to be created from a number of smaller ones. The process is ultimately limited by the column limit of 1,000 or the amount of memory your computer has free. If you want to change the standard data file location then please see ¬Setup╪9267 ¬. |Arcus File Finder| Arcus uses the following protocol to search through disks for files. You are shown a list of titles which you can select using the cursor keys and enter key or by using the mouse. Disk drives, directories, sub-directories and files are displayed differently: [-A-] <----this moves you to drive A [-B-] [\ARCUS] <----this moves you to directory \ARCUS [\DOS] IO.SYS AUTOEXEC.BAT CONFIG.SYS if we select <ARCUS> then <DATA> you might see: [..] <----this moves you back to the directory \ARCUS MYDATA RAT1 SURVEY2 <----this selects the file SURVEY2 Please note that you can jump to files beginning with a certain letter by pressing that letter on the keyboard when the file list is displayed. |Arcus Data File Path| This function enables you to select a worksheet file which has been stored in the standard Arcus data file location. If you installed Arcus on drive C using the default paths then this location will be C:\ARCUS\DATA. The files are presented to you in alphabetical order. If there are many files to sift through then press the first letter of the file name you are looking for. This causes the selection bar to jump to files beginning with that letter. The mouse can also be used to select files. The left hand mouse button or the enter key makes the selection. The Esc key or the right hand mouse button quits the file selector without loading a file. ¬<Arcus File Finder>╪45889 ¬ ¬<Data File Path>╪10092 ¬ |Select Path| This function enables you to bypass the standard Arcus data file location and specify your own path to a worksheet file. This situation might arise when you have a particular file on floppy disk. To examine the contents of a file in drive A just enter the path A:\. ¬<Arcus File Finder>╪45889 ¬ ¬<Data File Path>╪10092 ¬ |Save Worksheet| This function enables you to save all of the data in the worksheet to a file on disk. You are asked to specify the name of this file. No special extensions are added to this name and you can use your own extension if you wish. Try to adopt a simple naming system which you can recognise easily. Please note that you are presented with file names listed in alphabetical order when you come to recall worksheet files from disk. The location for storage of worksheet files is also under your control. Arcus prompts you with the standard data storage path defined when you installed Arcus e.g. C:\ARCUS\DATA. If this is acceptable then just press the enter key. If you wish to divert this file, say to a floppy disk, then type in the relevant path e.g. A:\ for the A drive. If you want to change the standard data storage path then you can do so via ¬setup╪9267 ¬. Arcus saves its worksheet files using a very simple text file structure. This allows software developers to read and write Arcus data files easily. If you are a developer then please refer to ¬developer's notes╪15349 ¬. A full description of each worksheet, up to 150 characters, can be added to each file. You are prompted for this, just press enter if it is not required. If you change your worksheet and forget to save it then you will be prompted to do so on finishing the current Arcus session. |Save Rotated Worksheet| This is a special function for those who wish to rotate an Arcus worksheet. Example: 1.. 2.. 3.. 1 1.1 0.7 1 2 1.5 0.6 2 3 1.6 0.6 3 4 1.8 0.5 4 ...this would become: 1.. 2.. 3.. 4.. 1 1.1 1.5 1.6 1.8 2 0.7 0.6 0.6 0.5 3 1 2 3 4 ... in other words rows become columns and columns become rows. The file extension ".ROT" is appended to your file name. Column names are lost. |Current Status| This function simply displays information concerning the current worksheet and the free memory state of your computer. The latter represents the number of kilobytes of memory which Arcus can use for data storage and processing. The time and date displays depend upon you having set these parameters properly. To change your computer's time or date, just shell out to DOS and enter them using TIME and DATE commands. Note that times are entered as 14:30:00 and dates are entered as 09-12-93. If your computer is not maintaining times and dates then its backup battery is probably flat. |Reset Parameters| This section provides you with the ability to wipe clean the current worksheet. For this reason you must be careful with these functions! "New Worksheet (50 columns)" This first selection simply wipes the worksheet leaving an empty 50 column sheet. "New Worksheet (user defined columns)" This second selection wipes the current worksheet and you select the column limit for the new worksheet. There are two main reasons for setting the column limit. The first is when you know that you will need more than 50 columns and you do not want to be prompted to extend the sheet each time you try to pass the column limit. Secondly you might need a very long column length on a computer with limited memory. To maximise column length in this situation you must select a small column limit. The absolute maxima are 1,000 columns and 25,000 rows. "Reset Printer" This selection enables you to reset the printer line counter. If you have a Laserjet or PostScript printer then this is automatically resets the printer as well as the line counter within Arcus. If you have any other line printer then you will need to align new paper to the top row before you continue. The function basically tells Arcus that you are starting over at a the top line. The next page break will happen when the page length is exceeded. If you need more information about setting up your printer for Arcus then please refer to the ¬setup╪9267 ¬ section. |ASCII & Lotus link| ¬<Plain ASCII file import>╪54064 ¬ ¬<Formatted ASCII file import>╪55002 ¬ ¬<Lotus compatible ASCII file export>╪55544 ¬ ¬<Lotus compatible WK? file import>╪52500 ¬ This section deals with the transfer of data between Arcus and other applications. Specifically, the import of data from ASCII text files and Lotus compatible spreadsheets and the export of data to spreadsheets. Please note that data can also be imported from database files using the ¬Database Manager╪59749 ¬. If you are a developer wishing to read and write Arcus worksheet files then please see ¬Developer's Notes╪15349 ¬. |Lotus Compatible WK? File Import| Arcus can read binary spreadsheet files which are compatible with Lotus 123 WKS or WK1 files. Applications such as Quattro, Excel and Symphony can export these files providing you specify the correct file format. Borland's Quattro automatically produces Lotus compatible files when you save a worksheet with the .WKS or .WK1 file extension (do not use .WKQ). One proviso is that you must use column labels in your original spreadsheet. Arcus uses column labels to identify where columns begin. Once the spreadsheet file has been read by Arcus you are given a list of column labels which have been found. You then simply select the columns you wish to bring across as Arcus worksheet columns. The label each column had in the spreadsheet is maintained in the Arcus worksheet. Things can get a bit slow with large spreadsheets therefore it is better have the spreadsheet (WK?) file on hard disk not floppy disk. Gaps within a spreadsheet column are interpreted as missing data. Gaps at the end of a spreadsheet column are not interpreted unless you enter a missing data value (3.456789E33) at the end of the column. The column label must be no more than one gap away from the top of the column of numeric data. If you need a larger gap at the top of a column then you must enter the Arcus missing data value (3.456789E33) at this position in the spreadsheet. Please note that all columns are transferred individually and are appended to the current worksheet if you have data there already. |Plain ASCII File Import| Plain ASCII file describes a simple text file which does not use any special characters or codes for formatting. Such a file might be produced by a database report generator or a simple text processor. This Arcus function enables you to pick out columns of numbers from such a file and load them into the current worksheet. Please use only plain text in these files, tabs and other formatting characters make it difficult to define columns. You pick out columns of numbers by selecting start, width and end points on the screen. For this purpose Arcus displays the text file on screen. If your file is greater than 80 columns then you are asked to define which horizontal section of the file you wish to search. Gaps or non-numeric text are treated as missing data. Importing data in this way can be quite irksome, therefore, you should consider other methods for frequent imports. |Formatted ASCII File Import| Some applications output data in text files which use spaces or commas to delimit data. One such application is FigP. Consider the file: 1.2,1.3,8 1.5,1,8 1.7,1.0,9 1.7,1.5,10 ..this would import into an Arcus worksheet as: 1... 2... 3... 1 1.2 1.3 8 2 1.5 1 8 3 1.7 1 9 4 1.7 1.5 10 NB Do NOT use spaces AND commas to separate your data, use EITHER commas OR spaces!. Do NOT use column titles in the text file. |Lotus Compatible ASCII File Export| All good spreadsheets can read comma and quote delimited text files. Column titles are contained within quotes and numeric data are separated by commas. Consider the text file: "Age","Urea","Creatinine" 65,6.5,101 23,3.4,65 44,4,80 ..this would export to a spreadsheet as: Age Urea Creatinine 1 65 6.5 101 2 23 3.4 65 3 44 4 80 Arcus does not export WK1, WKS, WKQ or any other binary spreadsheet files because there is no point when all good spreadsheets can read these simple portable comma and quote delimited text files. |Select Data| This function enables you to select data from a worksheet column which meet certain criteria that you define. It also enables you to pick out selected data and change them. There are two basic uses of this function which we shall look at by example: 1. Aim: To select all patients over 65 and their serum creatinines. Source: A column of ages and a column of creatinines from a group of 100 patients. Action: a. Select from column AGE. Match from column CREATININE. Expression is >65. Choose "create new variable". b. Select from column AGE. Match from column AGE. Expression is >65. Choose "create new variable". Result: Two new columns have been appended to the worksheet, one with ages over 65 and another with creatinine values for all the over 65's which match the ages in the other new column. 2. Aim: To replace certain values in a worksheet column. You might need this if you have imported data from an application which uses a different missing data value to Arcus. Source: Any column with unwanted data. Action: Select from this column. Choose "replace values". Specify the value to replace (eg -999). Specify the value to replace it with (eg 3.456789E33 the Arcus missing data value). Result: All -999's become 3.456789E33 (* in the Arcus worksheet ie missing data). Please note that the Arcus Database Manager can also be used to select out data before you import it to the Arcus worksheet. For more information on this please see ¬Record Selection╪69953 ¬. |Log File Editor| If you use the Arcus screen editor (invoked by pressing P or E) and choose the "save to log file" option (F2) then you will have a log file for that Arcus session saved in the Arcus data sub-directory. Each new Arcus session uses a separate log file name, this is composed of the day, the month and the number of the Arcus session on that day, i.e. 1201_3.LOG would be the log file from the third Arcus session in which a log file was used on the twelfth of January. This function provides a simple text editor with which you can examine and edit the content of any text file. It also enables you to send this text to a printer. If you require more powerful editing functions then please use your familiar word processor. Note that you can run your word processor within Arcus by shelling out to DOS, there is no need to finish your current Arcus session. The cursor location in the Arcus log file editor can be controlled using the cursor keys or the mouse and the left mouse button. The right mouse button and the Esc key quit the editor. The editor accepts standard key combinations: Ctrl+N Insert a line Ctrl+Y Delete a line Ctrl+P Embed a character Ctrl+Page Up Move to top of text Ctrl+Page Down Move to bottom of text If you want to enter a character which is not represented on your keyboard then you can do so by holding down the left Alt key whilst tapping out the ASCII code of that character on the right hand number pad (if present). For example, the code Alt + 224 gives the letter alpha. A list of these decimal codes is given under ¬<ASCII Codes>╪294887 ¬. If you intend to import Arcus log files into word processing software, try to specify small font sizes so that you avoid unwanted parsing of lines. Arcus |Database Manager| This provides a facility for creating and maintaining databases which are file compatible with dBase III plus, dBase IV, dBXL/Quicksilver, FoxPro, FoxBase or Clipper. It also enables you to import database fields as Arcus variables. Help prompts are provided in addition to the hypertext help which is invoked by pressing the F1 key. Help menus are also available via the F1 key within most functions. If you have even a vague idea of how databases work then you will find this part of Arcus Pro-Stat intuitively simple. If you are not familiar with database management systems then you may wish to read "¬Data-Basics╪74457 ¬". One notation convention which you should be aware of is the caret sign ^ followed by a key, this indicates that a combination of the Ctrl key plus that key should be pressed (i.e. ^Home is Ctrl + Home). For information about supported file structures, limits, record selection and other technical data then please refer to the ¬Database Technical Information╪66897 ¬ section. If you need to maintain complex multi-relational databases with elaborate reporting systems then you should select one of the dedicated database management systems and use the Arcus Database Manager as a link between this and the Arcus Worksheet. Please make sure that your database manager can export files which are readable by Arcus. Most database managers can export files in different formats. The database file formats which Arcus can read are dBASE III, dBASE IV, FoxPro, FoxBase, dBXL/Quicksilver and Clipper. |Open Database File| The first step in using this database manager is to open or create then open a database file. Arcus searches for files with the DBF extension and displays summary information about each compatible database file in the chosen sub- directory. The database file types which can be handled by this database manager are dBASE III, dBASE IV, FoxPro, FoxBase, dBXL/Quicksilver and Clipper. ¬<Create New Database File>╪64240 ¬ ¬<Arcus File Finder>╪45889 ¬ |Open Index File| This function allows you to open an index file which has been made for the database file which is open. Arcus searches for index files with the NDX extension and displays summary information about each compatible index file in the chosen sub-directory. The index file types which can be handled by this database manager are dBASE III, dBASE IV, FoxPro, FoxBase, dBXL/Quicksilver and Clipper. ¬<Index or Re-Index Database File>╪65035 ¬ ¬<Arcus File Finder>╪45889 ¬ |Copy Data to Another File| This function enables you to make new database files or new Arcus Worksheet files from records in the active database file. Both of these links allow you to be selective in the choice of records and the fields from each of these records which you copy to the new file. Please note that field names will be transferred to Arcus data files as worksheet column (variable) labels. |Browse and Edit Database| The browse & edit option presents your database in a worksheet format with fields as columns and records as rows. You can use this option to inspect and edit the existing records of the active database file. Please note that an index file for the active database will be updated when you edit records provided it has been opened, any other inactive index files on disk will not be updated. If you have several widely spaced fields to edit then you should use the rearrange fields option to collect these fields onto one screen before editing. If you wish to replace or remove records then please use the delete marker in the browse & edit function followed by the remove deleted records function within the pack/purge records option. If you wish to add new records then please use the append records option. |Append New Records| This function enables you to load data into the active database file. The enter key is used to confirm the input for a particular field but you must use F3 to accept the entire record and move onto the next. Familiarity with the function keys will facilitate easy use of this function. Please note that the date entry format is DD/MM/YY(YY) but a date is stored as YYYYMMDD in the database file. It is the YYYYMMDD format which is displayed in the "browse & edit function". If you use the "copy data to another file" function then all dates will be translated into Julian numbers. |Create New Database File| The first stage in making a new database file is to create a template using this function. This template defines how your data will be stored in the database file on disk. If you specify very large fields then the database manager will allocate more disk space per record. This can lead to much disk space being gobbled up by wasted space, please consider this when defining fields. Arcus supports dBASE III, dBASE IV, FoxPro, Clipper, FoxBase and dBXL/QuickSilver database file formats. Some formats permit larger field sizes and/or number of fields per record, please see "¬Database Technical Information╪66897 ¬" for more details about this. N.B. To make a new database active you must next select it via the "Open Database File" option. |Index or Re-Index Database File| You can use index files with all Arcus database files. Index files define how you look at the records in your active database file. One database file may have many index files so that you can look at records in different orders and separately. For example, if you had an age field in your database file you could use an index file based on all records to display them in age order. You could also select only those records falling within a certain age range. Please note that you must renew the index file each time you edit the parent database file, this is done via the create index file option. Once you have created an index file you must use the "select index file" function in order to make it active. |Modify Database Structure| This function enables you to add, remove, shorten, lengthen or rename the fields of an existing database file. It then refills the redefined file structure with the records from the original database file. Field data is truncated or padded as necessary. Please be careful not to loose data by imprudent use of this function. It is wise to make a copy of your original database file via the "copy data..." option before experimenting with this option. Arcus does, however, make a backup file (name.bak) of your original database file (name.dbf) when performing this function. |Pack or Purge Records| This function will compact an existing database file so that it takes up less disk space and can be read more efficiently. The purge procedure removes all records which have been marked as deleted so please use it with caution. |Print Report| This option enables you send selected database fields and records to a printer. The target printer port and the number of lines per page are defined via the setup menu in the main Arcus module. |Database Technical Information| ¬<Record Selection Functions>╪69953 ¬ Limits ~~~~~~ Maximum file size = 4.2 billion bytes Record limits depend upon the file type selected: --> dBASE III/III+ max record length: 4095 max no of fields: 128 field types: character 1-254 numeric 1-19 (0 to 15 decimal places) logical 1 date 8 memo 10 --> dBASE IV max record length: 4000 max no of fields: 255 field types: character 1-254 numeric 1-20 (0 to field length-2 decimal places) floating 1-20 (0 to field length-2 decimal places) logical 1 date 8 memo 10 --> FoxPro 1.0/2.0 max record length: 4000 max no of fields: 255 field types: character 1-254 numeric 1-20 (0 to field length-2 decimal places) floating 1-20 (0 to field length-2 decimal places) logical 1 date 8 memo 10 --> Clipper '87 5.0 max record length: 8192 max no of fields: 1023 field types: character 1-2048 numeric 1-30 (0 to 13 decimal places) logical 1 date 8 memo 10 --> dBXL, QuickSilver max record length: 4000 max no of fields: 512 field types: character 1-254 numeric 1-19 (0 to 15 decimal places) logical 1 date 8 memo 10 --> FoxBase 1.0/2.0 max record length: 4000 max no of fields: 128 field types: character 1-254 numeric 1-19 (0 to 15 decimal places) logical 1 date 8 memo 10 Date entry ~~~~~~~~~~ Please note that ArcusDB stores date fields in the format YYYYMMDD without any separators. This format is used in the browse & edit and print report sections. The append records section, however, uses the DD/MM/Y(YYY) format to accept initial input of dates; the Arcus worksheet uses this date entry system also. ArcusDB does NOT convert dates to Julian numbers for the database files but does convert them to Julian numbers when you export them as an Arcus worksheet file as this is the date storage format in the Arcus worksheet. Using Arcus Database Manager Independently ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The database module is designed to be called from the Arcus data management menu but it can be run without the main Arcus module. If you wish to run it independently then you must supply some parameters at the command line: ARCUSDB /?a/0/?b/?c/0/?d/ where ?a is the data storage path (e.g. C:\ARCUS\DATA\)-(NB DO NOT forget the last back slash here), ?b is the printer port (e.g. 1), ?c is the number of lines which your printer fits on one page (e.g. 64) and ?d is the mouse sensitivity (e.g. 30). This can be put into a batch file. Further details are available on request. |Record Selection Functions| Where S = string, N = Numeric, L = logical, D = Date: ABS(N) absolute value {ABS(5-11) is 6 not -6} ASC(S) ASCII value of first character {ASC("Abacus") is 65} AT(S1, S2) character position of S2 within S1 {AT("Hello World", "or") is 8} CAPS(S) capitalise the first letter of each word {CAPS("GOOD DAY") is "Good Day"} CHR(N) ASCII character {CHR(65) is "A"} DATE$ current system date DELETED() returns "T" if record is deleted (i.e. asterisk as first character) and "F" if it is not) IIF(X1, X2, X3) returns X2 if X1 is true else returns X3 {IIF(AGE>=65, "Ger", "Med") is "Ger" for age=78} INSTR(N, S1, S2) character position of S2 within S1 starting at N {INSTR(5, "Hello World", "l") is 10} INT(N) rounded down to the nearest integer {INT(3.5) is 3) JULIAN(D) returns the julian number of the date; this number is in character format LEFT(S, N) left N characters of S {LEFT("Pioneer", 2) is "Pi"} LEN(S) length if S {LEN("Pioneer") is 7} LOWER(S) lower case {LOWER("HELLO") is hello} LPAD(S1, N, S2) pad S1 to N characters with S2 at the left {LPAD("Hello", 12, "H") is "HHHHHHHHello"} LTRIM(S) cut leading blanks {LTRIM(" Here ") is "Here "} MAX(N1, N2) maximum of N1 and N2 {MAX(21, 21.01) is 21.01} MID$(S, N1, N2) extract N2 characters from S starting at N1 {MID$("Hello",2,1) is "e"} MIN(N1, N2) minimum of N1 and N2 {MIN(21, 21.01 is 21} RECNO() current record number RECORD() full content of current record in one string REPLICATE(S, N) N replicates of S {REPLICATE(".", 3) is "..."} RIGHT(S, N) right N characters of S {RIGHT("Pioneer", 2) is "er"} RPAD(S1, N, S2) pad S1 to N characters with S2 at the right {LPAD("Hello", 12, "*") is "Hello*******"} RTRIM(S) cut trailing blanks {LTRIM(" Here ") is " Here"} SPACE(N) N blanks {SPACE(3) is " "} STRING$(N, S/N) N repetitions of S or ASCII(N) {SRING$(3, 88) is "XXX") STR(N1, N2, (N3)) string of N1 of length N2 with N3 decimal places {STR(2.341, 6, 4) is 2.3410} SUBSTR(S,N1,N2) extract N2 characters from S starting at N1 {SUBSTR("Hello",2,2) is "el"} TIME() eight character string of current time TRIM(S) strip leading and trailing blanks {TRIM(" Here ") is "Here"} UPPER(S) convert to upper case {UPPER("Hello") is "HELLO"} VAL(S) numerical value of string {VAL("34") is 34.0} Record Selection Operators ~~~~~~~~~~~~~~~~~~~~~~~~~~ = equal to <> not equal to < less than > greater than >= greater than or equal to <= less than or equal to Boolean Operators ~~~~~~~~~~~~~~~~~ .AND. true if both expressions are true .OR. true if one expression is true .NOT. opposite truth of expression Concatenation Symbols ~~~~~~~~~~~~~~~~~~~~~ + combine expression - subtract expression Literals ~~~~~~~~ [] explicit expression not a field name Examples ~~~~~~~~ .NOT. PAID - gives records where the logical field paid is false (F/N) PAID .AND. AGE < 30 - gives records where the logical field paid is true (T/Y) for ages under 30 AGE >= 25 .AND. AGE < 30 - gives records for the age 25 to 30 range where age is a numeric field. VAL(AGE) >= 25 .AND. VAL(AGE) < 30 - gives records for the age 25 to 30 range where age is a character field. MID$(UPPER(NAME), 1, 1) = "A" - gives records where name begins with the letter A. ASC(MID$(UPPER(NAME), 1, 1)) >= 65 .AND. ASC(MID$(UPPER(NAME), 1, 1)) < 73 - gives records with names from A to H (see Appendix Three for ASCII codes). |DATA-BASICS| Think of a database as a section of a filing cabinet. The database manager enables you to create a special form for that section and to control the data which are contained in each form. A form contains one record. A record contains pieces of information, such as name, age, sex etc., in separate boxes called fields. The type of field depends upon the type of data it has been designed to accept, e.g. 10 characters or a number with 3 decimal places. All of this field information is defined when you create a new database file. The resulting template is then used to admit information to successive records. Arcus allows you to change this basic structure even after you have put information into the database file. The term "report" refers to information taken from the records in the database file for inspection on screen or print-out. This information consists of the fields and records which you specify. That brings us to another important term "record selection". Arcus uses the dBASE language to define your conditions for selecting the records which you want to look at. For example, you might want to consider only those aged 65 or over. In this case you enter the selection term as AGE >= 65 providing you have a field called AGE. These selection expressions can be highly complex, for more details see ¬<record selection functions>╪69953 ¬. You might have heard the term "relational database". This refers to the way in which several sections of our filing cabinet communicate or "relate". Say we had basic patient details in one section and information from a study in another then a link between the two. This link is a special field, such as case sheet number, which is common to both sections/databases. The current Arcus database manager does not provide relational operation. If you need this facility then you should use a dedicated database management system and use Arcus database manager as a link between this and the Arcus worksheet. The last term we shall examine here is "index". An index is a file which keeps track of the records in your database file. It enables you to specify an order in which you wish to work with database records. This order refers to one field e.g. surnames in alphabetical order. One database file can have many index files so that you can look at the same database in different ways. If you alter a database file in any way without having an index file open then the index will have lost track of your database. You must, therefore, re-index the database if you think changes have been made to the database file without having the index file open. |ANALYSIS| ¬<Worksheet oriented analysis>╪77578 ¬ ¬<Instant functions>╪213040 ¬ The analytical functions of Arcus are divided into the two sections shown above. Worksheet oriented functions require data which have been prepared using the worksheet in the data management section of Arcus. Instant functions prompt you for data when you select a function, e.g. a box to fill in a four fold contingency table. |Worksheet oriented analysis| ¬<Arithmetical Manipulation>╪78201 ¬ ¬<Descriptive Statistics>╪80612 ¬ ¬<Pictorial Statistics>╪81471 ¬ ¬<Parametric Methods>╪87475 ¬ ¬<Nonparametric Methods>╪98877 ¬ ¬<Regression and Correlation>╪119789 ¬ ¬<Analysis of Variance>╪158578 ¬ ¬<Survival Analysis>╪182274 ¬ All of the analysis functions which do not require their data to have been entered in the Arcus worksheet are described under "¬Instant Functions╪213040 ¬". Those functions which do require previously entered data from the Arcus worksheet are dealt with in this section. |Arithmetical Manipulation| This provides a selection of arithmetical treatments which can be applied to a worksheet column (variable). For example you could apply the expression V1 * (V1/SQR(V1)+2) to the variable V1 and the result of this equation for each of the data in the V1 variable would be placed in a new variable. The results are always stored in a new variable which you name, the data in the source variable are never altered. Arcus Pro-Stat can interpret a wide range of functions, these are identical to the cell editor functions which are described in the ¬Arcus worksheet╪36264 ¬ section of this hypertext. You apply the expression by entering it via the keyboard and you can call up a list of allowable functions by pressing the F1 key when you are editing the expression. There is also provision for you to create a new variable as a function of more than one existing Arcus variable; V1, V2, V3 etc. For example, if you wanted to create a column of electrical current values using Ohm's law (V = I * R) you could select resistance as V1 and voltage as V2 then apply the expression V2 / V1. ¬<Other Transformations>╪79380 ¬ |Other Transformations| Please note that logit, probit, angular and cumulative transformations are listed in this section whereas ranks, sortings and normal scores can be obtained via the nonparametric methods section. If you request probit, logit or angular transformation for a set of discrete data then the result for each data point will represent the transformation of the proportion (p) of the maximum in the variable which that point comes from. Logit transformation is defined as LOG(p/(1-p)) and provides a way of linearizing sigmoid distributions. Probit transformation is defined as 5 + Z(1-p) and also provides a way of linearizing sigmoid distributions. Angular transformation uses arcsin(*p), this provides a way of linearizing sigmoid distributions and equalising variances. For Logit and Probit transformations indeterminable values (when p=0 or p=1) are stored as missing data. The name for a variable resulting from one of these transformations is the name of the source variable suffixed with ~Pr, ~Lo, ~Ag or ~Cm as appropriate. N.B. - each time something computationally illegal, such as the natural logarithm of zero, is requested then the result is stored as the missing data value. |Descriptive Statistics| This option provides measures of location and dispersion which describe the data in any variable. You are given the number, arithmetic mean, variance, standard deviation, standard error of the arithmetic mean, confidence interval for the arithmetic mean, geometric mean, coefficient of skewness, coefficient of kurtosis, maximum, upper quartile, median, lower quartile, minimum and range for each selected variable. You can also choose to calculate any additional quantile and this is appended to the results listed above. Incalculable results are displayed as missing data using an asterisk (*). Arcus uses Kendall's definitions of skewness and kurtosis (ref 7). The relative merits of these descriptive methods are presented clearly and concisely in Aviva Petrie's book (ref 1). ¬<reference list>╪310584 ¬ |Pictorial Statistics| ¬<Histogram>╪82408 ¬ ¬<Box and Whisker Plot>╪83131 ¬ ¬<Scatter Plot>╪83837 ¬ ¬<Normal Plot>╪84456 ¬ ¬<Survival Plot>╪85093 ¬ ¬<Error Bar Plot>╪85626 ¬ ¬<Spread Plot>╪86257 ¬ ¬<Ladder Plot>╪86870 ¬ You can describe and relate your data graphically using these functions. Neat scales are chosen automatically for each function and the figure is composed using standard ASCII text characters or graphics images. High quality presentation graphics output can be obtained from the graphics functions when you are using a PostScript printer. Please note that you can also export PostScript images for use in most good word processing software; however the target printer must also be PostScript compatible. Printing is activated by pressing P when the figure is displayed. You can annotate the ASCII graphics before sending them to a printer or to a log file. |Histogram (ASCII)| The frequency distribution histograms are plotted horizontally across the screen with the count for each division displayed at the right hand side. This function divides your variable into x ranges between the minimum and maximum value of the selected variable. You specify x. Arcus then selects a "neat" set of midpoints for these ranges and draws horizontal bars to represent the number of data in the variable which fall into each of these ranges. For less than 64 data points per bar each asterisk (*) represents one count, above this value the bars are proportional representations but their true values can be gleaned from the counts display at the right hand side of the screen. |Box and Whisker Plot| Box and Whisker plots, described by Tukey (1977), give you a pictorial representation of the nonparametric descriptive statistics. In Arcus Pro-Stat, the "box" bounded by parentheses represents the distance between the first and third quartiles with the median between them marked by an asterisk (*), with the minimum as the origin of the leading "whisker" and with the maximum as the limit of the trailing "whisker". This is a very good way of showing an audience the spread of your data, it is much easier to convey than a dry list of nonparametric descriptive statistics. The graphics based version of this plot is intended for PostScript presentation graphics. |Scatter Plot (ASCII) & (Graphic)| This function plots a Y axis (ordinate) variable against an X axis (abscissa) variable. The scale selection for the axes is automatic. Superimposed plot points are displayed as the number of plot points at one screen location provided this number is less than 10. If more than 9 plot points lie at one screen location then it is marked with the letter X. The graphics based version of this function allows you to display up to four series which are displayed using different marker styles for each series and you can opt to display joining lines between the markers. |Normal Plot (ASCII)| The normal plot uses the same physical plotting procedures as the ASCII text based scattergram but you select only one variable which is plotted against its normal scores. Normal scores are calculated as Z((2k-1)/2n) where k is the rank of a datum in your variable, n is the number of data and Z is a quantile from the standard normal distribution. The linearity of the resultant plot indicates the normality of the distribution of the data in your selected variable. For a more objective assessment of normality please use the Shapiro-Wilk W test which is listed in the parametric methods section. |Survival Plot| This provides a graphics based step plot for displaying survival curves. It is intended to be used with variables for Time on the X axis and S (the Kaplan- Meier survivor function) on the Y axis. You can use up to four series and high quality output is available via a PostScript printer. This is a good accompaniment to a presentation of survival analysis which compares survival (or time to event) data in different groups. Please see ¬Kaplan-Meier╪182964 ¬ for more information on generating S. |Error Bar Plot| The high-low-close plots of business graphics packages can be difficult to manipulate if you have to display more than one series; therefore, I have included this function in Arcus. You can use up to four series for which you must provide three variables for each series; the X data, the Y data and the error function of the Y data. The error function can be, for example, the standard error of the mean for each Y when each Y point represents the mean of repeated observations. Different series are represented by different marker styles and you can opt to show joining lines between the markers. |Spread Plot| This is a very useful way of presenting the spread of data in up to four groups. It is one step back from the Box & Whisker plot in that it gives an entirely pictorial representation of the spread of your data. The axis is divided into an arbitrary number of divisions which are the width of a plot point; if more than one datum occupies a division it is plotted alongside the first, thus a concentration of data at a particular value is represented by a broad band. I liken this to a "statistical electrophoresis". High quality output is available when using a PostScript printer. |Ladder Plot| Arcus provides a ladder plot for the comparison of paired data from two groups. This is a useful pictorial accompaniment to paired t and wilcoxon signed ranks tests when the number of pairs is not too large. Each pair is joined by a line; these lines would look like the parallel rungs of a ladder if there was little difference between each pair. A presentation of continuous observations from a small to medium sized population before and after an intervention is conveniently represented by a ladder plot. High quality output is available when using a PostScript printer. |Parametric Methods| ¬<Tests using Student's t>╪87973 ¬ ¬<Z (Normal distribution) tests>╪96200 ¬ ¬<F (variance ratio) test>╪95622 ¬ ¬<Shapiro-Wilk test for normality>╪97214 ¬ This section provides various hypothesis tests and descriptive functions which assume that your data come from a normal distribution. The Shapiro-Wilk W test is, strictly speaking, a nonparametric method but it is included in this section because it enables you to test for "non-normality". |Tests using Student's t| ¬<Paired t test>╪88342 ¬ ¬<Single sample t test>╪91441 ¬ ¬<Unpaired (two sample) t test>╪93086 ¬ Please note that Student t tests using numbers, means and standard deviations directly instead of being calculated from worksheet columns are given in the Student's t distribution section of the instant functions module. |Paired t test| The paired t test provides an hypothesis test of the difference between population means for a pair of random samples whose differences are from an approximately normal distribution. A confidence interval is provided for the difference between the means and the limits of agreement are given (ref 4, 5). EXAMPLE: Comparison of peak expiratory flow rate before and after a walk on a cold winter's day for a random sample of 9 asthmatics. You enter two columns in the worksheet, one of PEFR's before the walk and the other of PEFR's after the walk. In this example each row must represent the same subject, in other studies the data might be matched / paired in some other way. subject before after 1 312 300 2 242 201 3 340 232 4 388 312 5 296 220 6 254 256 7 391 328 8 402 330 9 290 231 If you were to plot these pairs using a ladder plot you would see that all but one pair decreases. You might also wish to test the assumption that the differences are from a normal distribution, this can be done with the Shapiro -Wilk test. If you want to create a separate column of differences then press Alt+Q in the worksheet to create a new column as "after-before". Then select this function with say 95% confidence level when prompted. The results screen will show you p values and the confidence interval for the difference between the means. For our example: Mean of differences = 56.1 95% CI for difference between means = 29.8 to 82.4 two tailed p = 0.0012 ** A null hypothesis of no difference between the means is clearly rejected because the confidence interval does not include zero. ¬╪29175 ¬ ¬<confidence intervals>╪31897 ¬ ¬<reference list>╪310584 ¬ Considering other studies where the two groups represent two different ways of measuring the same thing or two different observers you might be interested in the limits of agreement. These limits are displayed on the standard paired t test results screen and an agreement plot is given after each paired t test. These only apply to agreement studies. When two methods of measurement are being compared it is almost always erroneous to present a scatter plot with correlation as a measure of agreement between the paired data obtained using the two methods of measurement. Highly correlated results often agree poorly, indeed large shifts in measurement scales may leave the correlation coefficient unaltered. It is therefore necessary to provide a quantification of agreement. This is done by use of the paired t-test and limits of agreement. Arcus allows you to select a confidence level for limits of agreement and provides an ASCII plot of the difference against the mean for each pair of measurements. This plot also displays the overall mean difference bounded by the limits of agreement. A good review of this subject has been provided by Martin Bland and Doug Altman (ref 29, 5). |Single sample t test| The single sample t method tests the null hypothesis that the population mean is equal to a specified value. If this value is zero then the confidence interval for the sample mean is given (ref 4, 5). EXAMPLE: Consider 20 first year resident doctors drawn at random from a regional health authority, resting systolic blood pressures measured using an electronic sphygmomanometer were: 128 127 118 115 144 142 133 140 132 131 111 132 149 122 139 119 136 129 126 128 From previous large studies of "healthy" individuals drawn at random from the general public (with the same male:female ratio) a resting systolic blood pressure of 120 mm Hg was predicted as the age matched population mean. To analyse these data in Arcus first prepare a worksheet column containing all 20. Then select the single sample t test from the parametric methods menu of the analysis section. Enter your population mean as 120 then run the test again without entering a population mean. For our example: sample mean = 130 95% CI for difference between means (i.e. sample-population) = 5.4 to 14.7 95% CI for sample mean = 125.4 to 134.7 two tailed p = 0.0002 *** A null hypothesis of no difference between sample and population means has clearly been rejected. Using the 95% CI we expect the mean systolic BP for this population of doctors to be at least 5 mm Hg greater than the age and sex matched general public, lying somewhere between 125 and 135 mm Hg. ¬╪29175 ¬ ¬<confidence intervals>╪31897 ¬ ¬<reference list>╪310584 ¬ |Unpaired (two sample) t test| The unpaired t method tests the null hypothesis that the population means relating to two independent, random samples from an approximately normal distribution are equal (ref 4, 5). A confidence interval is constructed for the difference between population means. This test must not be used if there is a significant difference between the variances of the two samples, this is tested for and you are given appropriate warnings. There are parametric alternatives which have been designed to cope with the situation of unequal variances, namely the methods due to Behrens and Welch, but the nonparametric Mann-Whitney test is more robust. EXAMPLE (from Armitage, ref 4 p 109): Consider the gain in weight of 19 female rats between 28 and 84 days after birth. 12 were fed on a high protein diet and 7 on a low protein diet: High Protein Low Protein 134 70 146 118 104 101 119 85 124 107 161 132 107 94 83 113 129 97 123 To analyse these data in Arcus first prepare them in two worksheet columns and label these columns appropriately. Then select the unpaired t test from the parametric methods menu of the analysis section. Request a 95% confidence interval (CI) by pressing the enter key when prompted. For this example: mean of "High Protein" = 120 g mean of "Low Protein" = 101 g difference between sample means = 19 95% CI for difference between population means = -2.2 to 40.2 two tailed p = 0.07 Thus we have a difference which is not quite significant at the 5% level. The most important information is, however, conveyed by the CI. The 95% CI includes zero therefore we can not be confident (at the 95% level) that these data show any difference in weight gain. As most of the interval is toward weight gain and as the test result is in the grey "suggestive" 5%-10% zone we have good evidence for repeating this experiment with larger numbers. Bigger samples will probably shrink the range of uncertainty so that the CI contracts to a narrower band clearly above zero. NB We did not consider a one tailed p here because we could not be absolutely certain that the rats would all benefit from a high protein diet in comparison with those on a low protein diet. They might have suffered adverse effects from our high protein diet. ¬╪29175 ¬ ¬<confidence intervals>╪31897 ¬ ¬<reference list>╪310584 ¬ |F (variance ratio) Test| This tests the equality of two variances from random samples which are approximately normally distributed. Only the upper tail probability need be considered because the larger variance is always used as the numerator in Snedecor's variance ratio F (ref 4, 5). In most situations this probability should be doubled to give a two tailed test. Analysis of variance can utilise a one tailed probability because the numerator and denominator of the variance ratio are predetermined. ¬╪29175 ¬ ¬<reference list>╪310584 ¬ |Z (normal distribution) Test| For large (n >= 50) normally distributed samples you can use this sensitive method which is equivalent to the single sample and unpaired t tests. You may either compare two independent random variables or compare the data in a variable with a known population mean. Remember that with large degrees of freedom a t distribution is approximately normal (ref 4, 5). EXAMPLE: See the examples for t tests and consider these in the context of larger samples. You will gain a little more sensitivity by using the normal distribution tests but you must have good reason to believe that your data have been drawn from a normal distribution. The t tests are less sensitive to small deviations from normality, so use them instead if you have any doubt. If your data are clearly non-normal then you must use one of the nonparametric methods even if you have large samples. ¬╪29175 ¬ ¬<confidence intervals>╪31897 ¬ ¬<reference list>╪310584 ¬ |Shapiro-Wilk test for non-normality| This test is a complex analysis of variance which can be used to test a variable for the non-normality of its data. There must be a random sample of between 3 and 2000 data. The null hypothesis of the test is that the sample is taken from a normal distribution, thus a significance level of < 0.05 rejects this supposition of normality. You should not use any of the parametric methods with variables for which W is significant. Most authors agree that this is the most reliable quantification of normality for small to medium sample sizes (ref 6, 21, A17, A18). EXAMPLE (Shapiro & Wilk ref 21): Consider the following 30 penicillin yields: 0.0958 0.0002 0.0333 -0.0026 0.0293 -0.0036 0.0246 -0.0042 0.0206 -0.0113 0.0194 -0.0139 0.0191 -0.0211 0.0182 -0.0333 0.0173 -0.0341 0.0132 -0.0363 0.0102 -0.0363 0.0084 -0.0402 0.0077 -0.0582 0.0058 -0.1184 0.0016 -0.1398 To test these data for non-normality using Arcus you must first prepare them in a worksheet column. Then select the Shapiro-Wilk test from the parametric methods menu of the analysis section. Here the test statistic was clearly significant at p = 0.002 which rejects the null hypothesis that these data are from a normal distribution. In fact these data were from a 2 by 5 factor grouping experiment. N.B. Do NOT use this test to say that your data are "normally distributed" this is quite wrong! The Shapiro-Wilk test is to provide evidence for certain types of "non-normality" it does NOT guarantee "normality". ¬╪29175 ¬ ¬<reference list>╪310584 ¬ |Nonparametric Methods| ¬<Mann-Whitney test>╪100302 ¬ ¬<Wilcoxon's signed ranks test>╪102924 ¬ ¬<Spearman's rank correlation>╪105582 ¬ ¬<Kendall's rank correlation>╪107776 ¬ ¬<Cuzick's test for trend>╪110163 ¬ ¬<Two sample Smirnov test>╪112415 ¬ ¬<Quantile confidence interval>╪114119 ¬ ¬<Save ranked data>╪115938 ¬ ¬<Save sorted data>╪117024 ¬ ¬<Save normal scores>╪118837 ¬ This section provides various hypothesis tests and descriptive functions which do not assume that your data are taken from normal distributions. When you have few data or there is doubt about their distribution then you should err on the side of caution and use nonparametric methods. These methods are usually less sensitive than their parametric counterparts but they are more robust. The numerical methods involved in these rank based calculations have progressed in the last few years and Arcus Pro-Stat utilises the most modern developments, including some calculations of exact probability in the presence of tied data. An excellent account of nonparametric methods is given by Conover (ref 6). In addition to the rank based tests below you can use three functions in this section to save the ranks, sorted data or normal scores of a variable into a new variable. The name of this new variable is the name of the source variable prefixed with Rk~, Sr~ or Ns~ as appropriate. |Mann-Whitney test| / Wilcoxon Rank Sum Test This is a distribution free method for the comparison of two independent random samples which have been measured using a scale that is at least ordinal. Arcus uses the sampling distribution of U to give exact probabilities. This can take a long time when there are tied data so please do not think that your computer has crashed. Confidence intervals are constructed for the difference between the two population means. The level of confidence used is as close as possible to that which you have selected. Arcus approaches the selected confidence level from the conservative side. When samples are large a normal approximation is used for the hypothesis test and for the confidence interval (ref 6, A6, A19, A20). EXAMPLE: (from Conover ref 6 p 218) The following data represent fitness scores from two groups of boys of the same age, those from homes in the town and those from farm homes: Farm Boys Town Boys 14.8 10.6 12.7 16.9 7.6 2.4 6.2 9.9 7.3 12.5 14.2 7.9 11.3 6.4 6.1 10.6 5.6 12.9 12.6 16.0 8.3 9.1 15.3 14.8 6.3 16.1 2.1 10.6 6.7 6.7 10.6 5.0 9.0 11.4 17.7 5.6 3.6 18.6 1.8 2.6 4.2 2.7 11.8 5.6 1.0 3.2 5.9 4.0 To analyse these data in Arcus you must first enter them in two separate worksheet columns. Then select the Mann-Whitney test from the nonparametric methods menu of the analysis section. Press enter when prompted for confidence interval specifications, this accepts the default 95% level. For this example: difference between sample medians = 0.8 two tailed p = 0.53 95.1% CI for difference between population means = -2.4 to 4.4 Here we have assumed that these groups are independent and that they represent at least hypothetical random samples of the sub-populations they represent. In this analysis we clearly have to accept the null hypothesis that one group does NOT tend to yield different fitness scores to the other. The extent of this lack of difference is shown by zero being contained well within the confidence interval for the difference between population means. Note that the quoted 95.1% confidence interval is as close as you can get to 95% because of the very nature of the mathematics involved in nonparametric methods like this. ¬╪29175 ¬ ¬<confidence intervals>╪31897 ¬ ¬<reference list>╪310584 ¬ |Wilcoxon's Signed Ranks| (matched pairs) test This is a nonparametric method for the comparison of a pair of samples whose component data have differences which are from a symmetrical distribution. A two tailed test uses the null hypothesis that the common median of the differences is zero. A confidence interval is constructed for the difference between the population medians. The sum of the ranks for the positive non-zero differences is given and the exact permutational probability associated with this value is calculated for sample sizes of less than 30. A normal approximation is used with sample sizes of 30 or more and when there are ties. Please note that some statistical software uses an old approximation formula which is inappropriate in the presence of ties. Conover (ref 6) states that in the presence of ties the test statistic must be the sum of signed ranks divided by the square root of this sum. You may be familiar with the old method of using the smaller sum of ranks in one direction but this is not appropriate with tied data. Confidence limits are calculated using critical values for k with sample sizes up to 30 or by calculating K* for samples with more than 30 observations (ref 6, A20). EXAMPLE (from Conover ref 6 p 283): The following data represent agressivity scores for 12 pairs of monozygotic twins: Firstborn: 86 71 77 68 91 72 77 91 70 71 88 87 Second Twin: 88 77 76 64 96 72 65 90 65 80 81 72 To analyse these data in Arcus you must first enter them into two columns in the worksheet. Then select Wilcoxon's signed ranks test from the nonparametric methods menu of the analysis section. Select a 95% confidence interval by pressing enter when prompted by the confidence interval menu. For this example: two tailed p = 0.45 median difference = 1.5 95.8% CI for the difference between population medians = -2.5 to 6.5 Assuming that the paired differences come from a symmetrical distribution then these results show that one group did not tend to yield different results to the other group which was paired with it, i.e. there was no statistically significant difference between the agressivity scores of the firstborn as compared with the second twin. The extent of this lack of difference is shown well by the confidence interval which clearly encompasses zero. Note that the quoted 95.1% confidence interval is as close as you can get to 95% because of the very nature of the mathematics involved in nonparametric methods like this. ¬╪29175 ¬ ¬<confidence intervals>╪31897 ¬ ¬<reference list>╪310584 ¬ |Spearman's Rank Correlation| This is a distribution free test of independence between two variables. It is, however, insensitive to some types of dependence. Kendall's tau gives a much better measure of correlation and is also a better test for independence in the two tailed setting. Spearman's rank correlation coefficient (rho) is given to six decimal places. The probability associated with rho is evaluated using a recurrence method when n < 7 and the Edgeworth series expansion when n >= 7 (ref A13). A confidence interval for rho is constructed using Fisher's Z transformation (ref 6, 11, 15). EXAMPLE (from Conover ref 6 p 283): The following data represent agressivity scores for 12 pairs of monozygotic twins: Firstborn: 86 71 77 68 91 72 77 91 70 71 88 87 Second Twin: 88 77 76 64 96 72 65 90 65 80 81 72 To analyse these data in Arcus you must first enter them into two columns in the worksheet. Then select Spearman's rank correlation from the nonparametric methods menu of the analysis section. Select a 95% confidence interval by pressing enter when prompted by the confidence interval menu. For this example: rho = 0.74 95% CI for rho = 0.28 to 0.92 two tailed p = 0.0082 ** Here we have clearly rejected the null hypothesis of mutual independence between the agressivity scores of pairs of twins. With a two tailed test we are considering the possibility of a positive or a negative correlation, i.e. we can't be sure of this direction at the outset. A one tailed test would have been restricted to correlation in one direction only i.e. big values of one group associated with big values of the other (positive correlation) or big values of one group associated with small values of the other (negative correlation). In our example we can conclude that there is a statistically significant lack of independence between agressivity scores of these twins. We could then go on to speculate that agressivity had an inherited component, especially if these twins were brought up by different families. ¬╪29175 ¬ ¬<confidence intervals>╪31897 ¬ ¬<reference list>╪310584 ¬ |Kendall's Rank Correlation| Spearman's rank correlation is satisfactory for testing a null hypothesis of independence between two variables but it is difficult to interpret when the null hypothesis is rejected. Kendall's rank correlation improves upon this by reflecting the strength of the dependence between the variables being compared. Arcus gives you the directional change statistics and the test statistic tau. In the presence of ties the test statistic tau b is given (as Kendall 1970). A normalised statistic (Z) is also given (continuity corrected and uncorrected) with associated probability and this is adjusted, using the full variance formula, in the presence of ties. In the absence of ties the probability associated with S (and thus tau) is evaluated using a recurrence formula when n < 9 and the Edgeworth series expansion when n >= 9 (ref A14). In the presence of ties you must accept the normal approximation (ref 6, 15). EXAMPLE (from Conover ref 6 p 283): The following data represent agressivity scores for 12 pairs of monozygotic twins: Firstborn: 86 71 77 68 91 72 77 91 70 71 88 87 Second Twin: 88 77 76 64 96 72 65 90 65 80 81 72 To analyse these data in Arcus you must first enter them into two columns in the worksheet. Then select Kendall's rank correlation from the nonparametric methods menu of the analysis section. For this example: tau = 0.56 continuity corrected two tailed p = 0.0136 * Here we have clearly rejected the null hypothesis of mutual independence between the agressivity scores of pairs of twins. With a two tailed test we are considering the possibility of a positive or a negative correlation, i.e. we can't be sure of this direction at the outset. A one tailed test would have been restricted to correlation in one direction only i.e. big values of one group associated with big values of the other (positive correlation) or big values of one group associated with small values of the other (negative correlation). In our example we can conclude that there is a statistically significant lack of independence between agressivity scores of these twins. We could then go on to speculate that agressivity had an inherited component, especially if these twins were brought up by different families. ¬╪29175 ¬ ¬<reference list>╪310584 ¬ |Cuzick's Test for Trend| This provides a Wilcoxon-type test for trend across a group of three or more independent randomly sampled variables. The component data must be at least ordinal and groups must be selected in a meaningful order i.e. ordered. A logistic distribution is assumed for errors. If you do not choose to enter your own group scores then scores are allocated uniformly (1 ... n) in order of selection of the n groups. For the null hypothesis of no trend across the groups T will have mean ET, variance VarT and the null hypothesis is tested using the normalised test statistic Z. Probabilities for Z are derived from the standard normal distribution. Please note that this test is more powerful than the application of the Wilcoxon rank-sum / Mann-Whitney test between more than two groups of data (ref 28). EXAMPLE (from Cuzick's paper ref 28): Mice were inoculated with cell lines, CMT 64 to 181, which had been selected for their increasing metastatic potential. The number of lung metastases found in each mouse after inoculation are quoted below: CMT 64 0, 0, 1, 1, 2, 2, 4, 9 CMT 167 0, 0, 5, 7, 8, 11, 13, 23, 25, 97 CMT 170 2, 3, 6, 9, 10, 11, 11, 12, 21 CMT 175 0, 3, 5, 6, 10, 19, 56, 100, 132 CMT 181 2, 4, 6, 6, 6, 7, 18, 39, 60 To analyse these data in Arcus you must first enter them in five worksheet columns labelled appropriately. Then select Cuzick's test for trend from the nonparametric methods menu of the analysis section. Just press N when you are asked if you want to enter group scores, this does not apply to most analyses provided you select the variables in the order you are studying them. With automatic group scoring you must be careful to select the variables in the order across which you want to look for trend. For this example: one tailed p (corrected for ties) = 0.017 * With these data we started out expecting a trend in one direction only, therefore, we can use a one tailed test for trend. We have show a statistically significant trend for increasing number of metastases across these malignant cell lines in this order. ¬╪29175 ¬ ¬<confidence intervals>╪31897 ¬ ¬<reference list>╪310584 ¬ |Two Sample Smirnov Test| Where you have two independent samples which have been drawn from possibly different populations then you might consider looking for differences between them using a t test or Mann-Whitney test. These tests are sensitive to differences between two means or medians but do not consider other differences such as variance. The two sample Smirnov method tests the null hypothesis that the distribution functions of the populations from which your samples have been drawn are identical. The test assumes that you have random samples which are mutually independent. The measurement scale must be at least ordinal but for an exact test you should use continuous data. EXAMPLE (from Conover ref 6 p 370): X Y 7.6 5.2 11.3 8.4 5.7 11.5 8.6 5.9 12.3 8.7 6.5 12.5 9.3 6.8 13.4 9.9 8.2 14.6 10.1 9.1 10.6 9.8 11.2 10.8 To analyse these data in Arcus you must first enter them into two worksheet columns and label them appropriately. Then select the two sample Smirnov test from the nonparametric methods section of the analysis section. For this example: two sided p = 0.26 Thus we can not reject the null hypothesis that the two populations from which our samples were drawn have the same distribution function. If we were interested in a one sided test then we would need good reason for expecting one group to yield values above (distribution shifted to the right of) or below (distribution shifted to the left of) the other group. For these data neither of the one tailed tests reached significance. ¬╪29175 ¬ ¬<reference list>╪310584 ¬ |Quantile Confidence Intervals| This selection from the nonparametric methods menu provides a confidence interval for any quantile. As with all nonparametric confidence intervals, the exact confidence level is not always attainable but the level which is exact to the interval constructed is displayed (ref 6,11). Arcus approaches the confidence interval from the conservative side, i.e. if the nearest levels to 95% are 94.4% and 95.9% then the latter will be chosen. For sample sizes greater than 30 a reliable approximation based on the central limit theorem is used (ref 6). A presentation of medians and their confidence intervals is often more meaningful than the time honoured (abused) tradition of presenting means and standard deviations. A box and whisker plot is a useful accompaniment to this function. EXAMPLE (from Conover ref 6 p 113): The following represent times to failure in hours for a set of pentode radio valves: 46.9 56.8 63.3 67.1 47.2 59.2 63.4 67.7 49.1 59.9 63.7 73.3 56.5 63.2 64.1 78.5 To analyse these data in Arcus you must first enter them into a worksheet column and label it appropriately. Then select the quantile confidence interval from the nonparametric methods section of the analysis section. For a 90% confidence interval select the 90% button from the confidence interval menu. Then enter 0.75 to specify that the quantile you want is the upper quartile or 75th percentile. For this example: upper quartile = 66.35 90% confidence interval = 63.3 to 73.3 exact confidence level = 90.94% We may conclude that with 91% confidence the population value of the upper quartile lies between 63.3 and 73.3 hours. ¬<confidence intervals>╪31897 ¬ ¬<reference list>╪310584 ¬ |Save Ranked Data| This function enables you to save the ranks of a worksheet variable into a new variable. The name of this new variable is the name of the source variable prefixed with Rk~. You can choose to calculate a correction factor for ties in the ranking. Four formulae are offered for tie correction: 1. Σ(t3 - t / 12) 2. Σ(t * (t-1) /2) 3. Σ(t * (t-1) * (2t+5)) 4. Σ(t * (t-1) * (t-2)) ...where t is the number of data tied at each tie and upper case sigma (Σ) is the summation across these ties. EXAMPLE: Ranking the following agressivity scores for a sample of firstborn twins gives: First Born -----> Rk~First Born (Ranks) 86 8 ┌─71 3.5 │ 77──────┐ 6.5 │ 68 │ 1 │ 91─┐ ├tie 11.5 tie┤ 72 ├tie │ 5 │ 77─│────┘ 6.5 │ 91─┘ 11.5 │ 70 2 └─71 3.5 88 10 87 9 |Save Sorted Data| This function enables you to save the data of a worksheet variable into a new variable in a sorted form. The name of this new variable is the name of the source variable prefixed with Sr~. Sorting may be ascending or descending. The sort may also be tied to the data of another variable, i.e. the data in variable b may be sorted in the in the order of sorting variable a. This paired sorting can be repeated for any number of columns. EXAMPLE: Sorting the following agressivity scores for a sample of firstborn twins in ascending order gives: First Born -----> Sr~First Born (Sorted) 86 68 71 70 77 71 68 71 91 72 72 77 77 77 91 86 70 87 71 88 88 91 87 91 EXAMPLE 2: Sorting the following agressivity scores for a sample of second born twins by the ascending order of the scores for firstborn twins gives: First Born Second Born -----> Sr~Second Born~First Born 86 88 64 71 77 65 77 76 80 68 64 77 91 96 72 72 72 76 77 65 65 91 90 88 70 65 72 71 80 81 88 81 96 87 72 90 |Save Normal Scores| This function enables you to save the normal scores of a worksheet variable into a new variable. The name of this new variable is the name of the source variable prefixed with Ns~. Normal scores are defined here as Z((2k-1)/2n) where k is the rank, n is the sample size and Z is a standard normal deviate. EXAMPLE: Scoring the following agressivity scores for a sample of firstborn twins using the normal score formula above gives: First Born -----> Ns~First Born (normal scores) 86 0.3186 71 -0.6745 77 0 68 -1.7317 91 1.3186 72 -0.3186 77 0 91 1.3830 70 -1.1503 71 -0.6745 88 0.8122 87 0.5485 |Regression and Correlation| This section provides various regression and correlation analyses. Please note that Kendall's and Spearman's correlations are provided in the nonparametric methods section. ¬<Simple linear>╪124799 ¬ ¬<Multiple linear>╪129024 ¬ ¬<Regression in Groups>╪135745 ¬ ¬<Polynomial>╪144338 ¬ ¬<Linearized>╪148165 ¬ ¬<Probit Analysis>╪149788 ¬ ¬<Non-Linear Models>╪156043 ¬ REGRESSION ~~~~~~~~~~ Regression is a way of describing how one variable, the so called dependent variable, is numerically related to other, so called predictor variables. The dependent variable is also referred to as Y and is plotted on the vertical axis (ordinate) of a graph. The predictor variable(s) is(are) also referred to as X, independent, prognostic or explanatory variables. The horizontal axis (abscissa) of a graph is used for plotting X. Predictors are variables which we must be able to measure without error and we must have reason to assume that the errors associated with measuring Y are randomly distributed. All of the conclusions that we draw from regression depend upon the truth of these assumptions about error. The commonest assumption is that the errors in Y are from a random normal distribution. If this assumption is reasonable and we suspect that the changes in Y are proportional to the changes in X then we can try linear regression: Y (% Growth 70-100 days) │ * │ * * * │ │ * │ * │ * │ * * │ * * │ │ * * └─────────────────────────── X (Birth Weight) Looking at the data like this is a vital first step. From the graph we suspect that low birth weight babies grow faster in the 70-100 days interval than their higher birth weight counterparts. You could almost draw a straight line through the points, therefore, assuming growth between 70 and 100 days is from a normal distribution we can try to fit a straight line equation using simple linear regression on these data: Equation: Y = A + BX B is the gradient, slope or regression coefficient. A is the intercept of the line at Y axis or regression constant. The equation describes the best relationship between the POPULATION values of X and Y which can be found using this method. When you have obtained this equation it can be used to for prediction and various hypothesis tests. N.B. Always think of the biological relevance of this equation, i.e. in our example we must not get carried away with the idea that the growth of a baby between 70 and 100 days after birth is a simple linear function of their birth weight as there are many other variables affecting the babies' growths. We could gather more information to feed into a complex multiple regression but it is very unlikely that we could satisfy all of the above assumptions . For these reasons data which are not drawn from highly controlled isolated experiments must be treated with caution. MATHS: The basic method used to find the regression equation is called least squares. This minimises the sum of the squares of the errors associated with each Y point by differentiation. This error is the difference between the observed Y point and the Y point predicted by the regression equation. In linear regression this error is also the error term of the Y distribution, the residual error. ASSUMPTIONS: X observed without error Y drawn at random from a normal distribution for each X True mean of Y distribution for each Y lies on regression line All Y distributions have same variance (this is homoscedasticity) Y error is independent of X CORRELATION ~~~~~~~~~~~ This refers to the interdependence or co-relationship of variables. In the context of our example it looks at the closeness of the linear relationship between X and Y. A measure of this is given by Pearson's product moment correlation co-efficient rho. Rho is called R when it has been estimated from a regression on sample data. R lies between -1 and 1 with 0 for no linear correlation, 1 for perfect positive (slope up) linear correlation and -1 for perfect negative (slope down) linear correlation. N.B. If R is close to ± 1 then this does NOT mean that there is a good causal relationship between X and Y. It just shows that the sample data is close to a straight line. R is a much abused statistic! MATHS: R squared is the proportion of the total variance of Y that can be explained by the linear regression of Y on X. 1-R² is the proportion that is not explained by the regression. Thus 1-R² = S²XY / S²Y. |Simple Linear Regression| This provides simple linear regression (Y = A + BX) by the least squares method. It is assumed that for each of the X values the corresponding Y values have been drawn at random from a normal distribution. Summary statistics are given in full as a springboard for further analysis. Pearson's product moment correlation coefficient (r) is given as a measure of association between the two variables. Confidence limits are constructed for the correlation coefficient using Fisher's Z transformation. The null hypothesis that r = 0 (i.e. no association) is evaluated using a modified t test (ref 4, 5). The estimated regression line may be plotted and belts representing the standard error and confidence interval for the population value of the slope can be displayed. These belts represent the reliability of the regression estimate, the tighter the belt the more reliable the estimate (ref 11). NB If you require a weighted linear regression then please use the multiple linear regression function in Arcus, it will allow you to use just one predictor variable i.e. the simple linear regression situation. Note also that the multiple regression option will allow you to select regression without an intercept i.e. forced through the origin. EXAMPLE (from Armitage ref 4 p 148): The following data represent birth weights of babies and their percentage increase between 70 and 100 days after birth: X (birth weight oz) Y (increase in weight 70-100 days as % of X) 72 68 112 63 111 66 107 72 119 52 92 75 126 76 80 118 81 120 84 114 115 29 118 42 128 48 128 50 123 69 116 59 125 27 126 60 122 71 126 88 127 63 86 88 142 53 132 50 87 111 123 59 133 76 103 72 106 90 118 68 114 93 94 91 To analyse these data in Arcus you must first enter them into two columns in the worksheet appropriately labelled. Then select simple linear regression from the regression and correlation menu of the analysis section. Press enter when you are prompted for a confidence interval, this will select the default 95% level. For this example: Y = -0.8643X + 167.8701 95% CI for slope = -0.5055 to -1.2231 r square = 0.4465 F for regression = 24.2 (p = < 0.0001) r = -0.6682 95% CI for r = -0.4166 to -0.8248 two tailed p (for r = 0) = < 0.0001 From this analysis we have gained the equation for a straight line forced through our data i.e. % increase in weight = 167.87 - 0.864 * birth weight. The r square value tells us that about 42% of the total variation about the Y mean is explained by the regression line. The analysis of variance test for the regression, summarised by the ratio F, shows that the regression itself was statistically highly significant. This is equivalent to a t test with the null hypothesis that the slope is equal to zero. The confidence interval for the slope shows that with 95% confidence the population value for the slope lies somewhere between -0.5 and -1.2. The correlation coefficient r was statistically highly significantly different from zero. Its negative value indicates that there is an inverse relationship between X and Y i.e. lower birth weight babes show greater % increases in weight at 70 to 100 days after birth. With 95% confidence the population value for r lies somewhere between -0.4 and -0.8. ¬<regression and correlation>╪119789 ¬ ¬╪29175 ¬ ¬<confidence intervals>╪31897 ¬ ¬<reference list>╪310584 ¬ |Multiple Linear Regression| If you need to study the effect of simultaneous changes in several independent variables (e.g. creatinine clearance and mean systolic blood pressure) upon one dependent variable (e.g. post-anaesthetic recovery time) then you might find multiple linear regression useful. Arcus uses singular value decomposition to solve the linear equations, this is a robust method which optimises accuracy and is not stalled by serial correlation. The multiple regression equation is given and the significance of each component parameter is indicated. There are also options for analysis of variance and interpolation. The analysis of variance provides a test of independence for the Y variable in comparison with the X variables. A multiple correlation coefficient is given with the analysis of variance. A logical extension of multiple linear regression is the selection of predictor (X, independent) variables. There are a number of methods which deal with this, for example step-up selection, step-down selection, stepwise regression and best subset selection. The fact that there is not a predominantly favoured method means that none of them are really satisfactory for general use, a good discussion is given by Draper and Smith (ref 23). The current version of Arcus provides best subset selection by examination of all possible regressions. You have the option of two selection criteria, minimum Mallow's Cp statistic or maximum overall F. You may also force the inclusion of variables in this selection procedure if you consider their exclusion to be illogical in "real world" terms (ref 23). EXAMPLE (from Armitage ref 4 p 300): The following data are from a trial of a hypotensive drug used to lower blood pressure during surgery. The outcome / dependent variable (Y) is minutes taken to recover an acceptable (100mmHg) systolic blood pressure and the two predictor or explanatory variables are, log dose of drug (X1) and mean systolic blood pressure during the induced hypotensive episode (X2). X1 X2 Y 2.26 66 7 1.81 52 10 1.78 72 18 1.54 67 4 2.06 69 10 1.74 71 13 2.56 88 21 2.29 68 12 1.80 59 9 2.32 73 65 2.04 68 20 1.88 58 31 1.18 61 23 2.08 68 22 1.70 69 13 1.74 55 9 1.90 67 50 1.79 67 12 2.11 68 11 1.72 59 8 1.74 68 26 1.60 63 16 2.15 65 23 2.26 72 7 1.65 58 11 1.63 69 8 2.40 70 14 2.70 73 39 1.90 56 28 2.78 83 12 2.27 67 60 1.74 84 10 2.62 68 60 1.80 64 22 1.81 60 21 1.58 62 14 2.41 76 4 1.65 60 27 2.24 60 26 1.70 59 28 2.45 84 15 1.72 66 8 2.37 68 46 2.23 65 24 1.92 69 12 1.99 72 25 1.99 63 45 2.35 56 72 1.80 70 25 2.36 69 28 1.59 60 10 2.10 51 25 1.80 61 44 To analyse these data in Arcus you must first enter them into three columns in the worksheet appropriately labelled. Then select multiple linear regression from the regression and correlation menu of the analysis section. Press Esc when you are asked for the standard deviations of Y, i.e. selecting an unweighted analysis. Press Y when you are asked whether you want an intercept, one can rarely find a good enough reason not to have an intercept. For this example: Y = 23.01 + 23.639 X1 - 0.715 X2 Intercept b0 = 23.01067 (p = 0.214) X1 b1 = 23.63856 (p = 0.001) X2 b2 = - 0.71468 (p = 0.022) r square = 0.2018 r square adjusted = 0.1699 F = 6.32 (p = 0.001) The variance ratio, F, for the overall regression is highly significant thus we have very little reason to doubt that either X1 or X2 is, or both are, associated with Y. The r square value shows that only 20% of the variance of Y is accounted for by the regression, therefore the predictive value of this model is low. The partial correlation coefficients are shown to be significant but the intercept is not. Arcus offers more facilities for general linear regression than I have shown here. The use of these facilities requires a reasonable background knowledge of general linear models and their assumptions. For this reason I shall not discuss all of these facilities with examples, the experienced user will be familiar with these facilities. A good reference is Draper & Smith ref 23. In summary, these facilities are: 1. Best subset selection. When you have many predictor variables you can ask Arcus to select the subset of predictor variables which gives the "best" fitting model as judged by Mallow's Cp statistic or the overall significance of the regression. Mallow's Cp is favoured in most situations. 2. XXi matrix. This prints out the XXi or Hat / projection matrix of the linear model. Double precision is displayed as the singular value decomposition of this general linear regression is performed in double precision. 3. Influential data. This gives an analysis of residuals and allows you to save the residuals and their associated statistics. It is good practice to examine a plot of the residuals against Y. You might also wish to have a normal plot of the residuals, this is available in the pictorial statistics menu of the Arcus analysis section. Along with the residuals you are given the standard error of the predicted Y, the leverage Hi (the ith diagonal element of the Hat matrix), Studentized residuals, Cook's distance , covariance and DFFITS. Note that Studentized residuals have a t distribution with n-p-1 degrees of freedom. If Hi is larger than 2p/n then that observation has unusual predictor values. Unusual predicted as opposed to predictor values are indicated by large residuals. Cook's distance and DFFITS combine these factors in an overall measure. Cook's D can be considered large if it exceeds F (0.50, p, n-p) from the F distribution. DFFITS is unusually large if it is greater than 2 * SQR(p/n). Unusual covariance ratios are considered to lie outside the range 1 - 3 * (n/p) to 1 + 3 * (n/p). A good discussion of the analysis of residuals is given by Belsley et al. ref 32. In this paragraph p = number of coefficients in the model (including constant) and n = number of observations. ¬╪29175 ¬ ¬<confidence intervals>╪31897 ¬ ¬<reference list>╪310584 ¬ |Regression in Groups| ¬<Linearity with replicates of Y>╪136133 ¬ ¬<Grouped linear regression with covariance analysis>╪139152 ¬ This sub-section provides grouped linear regression and analysis of covariance. There is also a test for linearity when repeated observations of the Y (dependent) variable are available for each observation in the X (independent) variable. |Linearity with replicates of Y| The standard analysis of variance for a linear regression tells you about the significance of the slope but it does not test whether or not you should be using linear regression in the first place. Here we provide a method which can be used to test the assumption of linearity. In important studies which utilise linear regression it is worth collecting repeat Y observations. This enables you to run a test of linearity and thus justify or refute the use of linear regression in subsequent analysis of these data (ref 4). The replicate Y observations should be entered in separate worksheet columns (variables), one column for each observation (row) in the X variable. The number of Y replicate variables which you are prompted to choose is governed by the size of the X variable which you have selected. EXAMPLE (from Armitage, ref. 4 p268): A preparation of vitamin D is tested by feeding it to rats with induced osteomalacia and measuring the subsequent re-mineralisation of their bones using radiographic methods: Log dose of Vit D ---> 0.544 0.845 1.146 Bone density score --> 0 1.5 2 0 2.5 2.5 1 5 5 2.75 6 4 2.75 4.25 5 1.75 2.75 4 2.75 1.5 2.5 2.25 3 3.5 2.25 3 2.5 2 3 4 4 To analyse these data in Arcus you must first enter them into four columns in the worksheet appropriately labelled. The first column is just three rows long and contains the three log doses of vitamin D above. The next three columns represent the repeated measures of bone density for each of the three levels of log dose of vitamin D which are represented by the rows of the first column. Then select the linearity function from the regression in groups sub-menu of the regression and correlation menu in the analysis section. When you are prompted for the X variable select the column which contains the three log dose levels. Then select the three Y columns which correspond to each row (level) of the X variable i.e. 0.544 --> 0.845 --> 1.146. For this example: Due to regression F = 9.45 (p = 0.0047) Deviations from X means F = 1.95 (p = 0.1738) Thus the regression itself (meaning the slope) was statistically highly significant. If the deviations from X means had been significant then we should have rejected our assumption of linearity, as it stands they were not. Arcus gives you plain English interpretations of these results directly. ¬╪29175 ¬ ¬<non-linear models>╪156043 ¬ ¬<reference list>╪310584 ¬ |Grouped linear regression with covariance analysis| The grouped regression function enables you to compare regression lines. Again it is assumed that for each of the X values the corresponding Y values have been drawn at random from a normal distribution. The method involves examination of the regression parameters for a group of XY pairs in relation to a common fitted function. This provides an analysis of variance which shows whether there is a significant difference between the slopes of the individual regression lines as a whole. Arcus then compares all of the slopes individually. The vertical distance between each regression line is then examined using analysis of covariance and the corrected means are given (ref 4) This is just one facet of the analysis of covariance and there exist alternative methods. For further information please consult good references such as Draper & Smith (ref 23) and Armitage & Berry (ref 4). EXAMPLE (from Armitage ref. 4 p 277): Three preparations of vitamin D are tested by feeding them to rats with induced osteomalacia and measuring the subsequent re-mineralisation of their bones using radiographic methods: For the standard preparation: Log dose of Vit D ---> 0.544 0.845 1.146 Bone density score --> 0 1.5 2 0 2.5 2.5 1 5 5 2.75 6 4 2.75 4.25 5 1.75 2.75 4 2.75 1.5 2.5 2.25 3 3.5 2.25 3 2.5 2 3 4 4 For alternative preparation I: Log dose of Vit D ---> 0.398 0.699 1.000 1.301 1.602 Bone density score --> 0 1 1.5 3 3.5 1 1.5 1 3 3.5 0 1.5 2 5.5 4.5 0 1 3.5 2.5 3.5 0 1 2 1 3.5 0.50 0.5 0 2 3 For alternative preparation F: Log dose of Vit D ---> 0.398 0.699 1.000 Bone density score --> 2.75 2.5 3.75 2 2.75 5.25 1.25 2.25 6 2 2.25 5.5 0 3.75 2.25 0.5 3.5 To analyse these data in Arcus you must first enter them into 14 columns in the worksheet appropriately labelled. The first column is just three rows long and contains the three log doses of vitamin D for the standard preparation. The next three columns represent the repeated measures of bone density for each of the three levels of log dose of vitamin D which are represented by the rows of the first column. This is then repeated for the other two preparations. Then select the grouped linear regression function from the regression in groups sub-menu of the regression and correlation menu in the analysis section. Enter 3 as the number of XY pairs and select Y when asked if you wish to use replicates. When you are prompted for the first X variable select the column which contains the three log dose levels for the standard preparation. Then select the three Y columns which correspond to each row (level) of the X variable for the standard preparation i.e. 0.544 --> 0.845 --> 1.146. Alternatively these data could have been entered in just three pairs of worksheet columns representing the three preparations with a log dose column and column of the mean bone density score for each dose level. By accepting the more long winded input of replicates Arcus is encouraging you to run a test of linearity on your data. For this example: common slope p = < 0.0001 between slopes p = 0.1510 slope comparisons: standard vs I p = 0.4195 standard vs F p = 0.0379 I vs F p = 0.0325 corrected covariance analysis: F = 1.69 (p = 0.2510) vertical separations: standard vs I p = 0.3070 standard vs F p = 0.4345 I vs F p = 0.2493 The common slope is highly significant and the test for difference between the slopes overall was non-significant. Provided that our assumption of linearity holds true we can conclude that these lines are reasonably parallel. Looking more closely at the individual slopes preparation F is shown to be significantly different from the other two but this difference was not large enough to throw the overall slope comparison into a significant heterogeneity. The analysis of covariance did not show any significant vertical separation of the three regression lines. ¬╪29175 ¬ ¬<reference list>╪310584 ¬ |Polynomial Regression| If you have reason to believe that a polynomial model is appropriate to your data then you can use this function to construct one. You supply the number of degrees (order) of the polynomial and Arcus gives you the coefficient for each degree of the equation together with the constant. Subjective goodness of fit may be assessed by plotting the data and the fitted curve. Try to use as few degrees as possible for a model which achieves significance at each degree. Regression is by singular value decomposition (ref 23, 14). An analysis of variance is given via the analysis option. There is also an option which calculates the area under the curve. The polynomial function which has been fitted is integrated from the lowest to the highest X value using Romberg's method to give an area under the fitted curve. The trapezoidal rule is also used directly on the vector to give another estimate of the area under the curve. The plot function supplies visual confidence and prediction intervals but you can save the predicted Y values with their errors and intervals by selecting option [6]. If you require more detail from the regression, such as an analysis of the residuals, then you should use the multiple linear regression option. To achieve a polynomial fit using multiple linear regression you must first create new worksheet columns which contain the X variable raised to powers up to the degree you want. For example, a second order fit requires Y, X and X * X. EXAMPLE: (from Statistics ref 34 p 753): Here we will use a non-biomedical example to emphasise the point that polynomial regression is more often applicable to data from the physical sciences where variables are more controllable. Below are the electricity consumption data in kilowatt hours per month from ten houses and the areas in square feet of these houses: House area KW-hours per month 1290 1182 1350 1172 1470 1264 1600 1493 1710 1571 1840 1711 1980 1804 2230 1840 2400 1956 2930 1954 To analyse these data in Arcus you must first prepare them in two worksheet columns appropriately labelled. Then select polynomial regression from the regression and correlation menu of the analysis section. The X (independent) variable is house area and the Y (dependent) variable is KW-hours per month. Enter the order of this polynomial as 2. For this example: KW-hours = -1216.14389 + 2.39893 * area - 0.00045 * area * area F = 189.71 (p < 0.0001) Root MSE = 46.801 R sqr = 0.9819 for intercept p = 0.0016 X p < 0.0001 X*X p = 0.0001 Thus the overall regression and both degree coefficients are highly significant. NB Look at a plot of this data curve. The right hand end point shows a very sharp decline. If you were to extrapolate beyond the data you have observed then you might conclude that very large houses have a very low electricity consumption. This is obviously ludicrous. Polynomials are often well out of line with common sense in parts of the curve but seem to fit other parts well. You must blend common sense, art and mathematics when fitting these models! Remember that, a) your model will be much more reliable if it is built around large numbers of observations, b) do not extrapolate beyond your observations, c) choose numbers for X which are not too large as they will cause overflow with higher degree polynomials, d) do not draw false confidence from low p values, only use these to support your model if the plot looks good! ¬╪29175 ¬ ¬<non-linear models>╪156043 ¬ ¬<reference list>╪310584 ¬ |Linearized Estimates| This section provides regression estimates for three linearised functions by an unweighted least squares method. This approach is far from ideal and should be used only to indicate that a more robust fit of the selected model might be appropriate for your data. Exponential, geometric and hyperbolic approximations are offered. For the exponential model the data are linearized by log transformation of the independent variable and the linear regression gives you A and B for the function Y = A * exp(B * X). For the geometric method the natural logarithms of both variables are linearly regressed for Y = A * (X ^B). The hyperbolic method uses the reciprocals of both variables to calculate A and B for Y = X / (A + B * X). The standard error of the estimate is given for each of these regressions but please note that the errors of your dependent / response variable might not be from a normal distribution. This section of Arcus is only intended for those who are familiar with regression modelling and who use these linearized estimates as a springboard for further modelling. For these reasons we will not work through an example here. For generalized linear modelling I recommend the products of The Numerical Algorithms Group and Rothamstead Experimental Station, these are GLIM and Genstat. For non-linear modelling I recommend MLP and Genstat. For information on all of these products contact NAG on UK (0)865 511 245. ¬╪29175 ¬ ¬<confidence intervals>╪31897 ¬ ¬<non-linear models>╪156043 ¬ ¬<reference list>╪310584 ¬ |Probit Analysis| When biological responses are plotted against their causal stimuli (or logarithms of them) they often describe a sigmoid curve. Methods have been developed which linearize this relationship so that they are easier to deal with numerically. This linearization can be achieved using a number of transformations including logit, probit and angular. For most systems the probit (normal sigmoid) and logit (logistic sigmoid) give the most closely fitting result. Logistic methods are also useful in Epidemiology because odds ratios can be determined easily from differences between fitted logits. In biological assay, however, probit analysis is preferable (ref 18, 19). Curves produced by these methods are very similar, with maximum variation occurring within 10% of the upper and lower asymptotes. Historically some workers have used logistic regression because it is easier to calculate than probit analysis, this is no longer true with the aid of computers. Probit analysis has been added to Arcus to provide dose/stimulus - response curve fitting. Your data are entered as dose levels, number of subjects tested at each dose level and number responding at each dose level. You are also given the opportunity to enter a control result for the number of subjects responding in the absence of dose/stimulus - this provides a global adjustment for natural mortality/responsiveness. You are also asked whether you want log transformation of the dose levels or not. The curve is then fitted by Newton -Raphson iteration. The quality of the resultant curve is assessed by statistics for heterogeneity which follow a chi-square distribution. If these are significant then your observed values deviate from the fitted curve too much for reliable inference to be made from that curve (ref 18, 19). Arcus gives you the effective/lethal levels of dose/stimulus with confidence intervals at the quantiles you specify. The fitted curve can be plotted and printed. If you require more complex probit analysis, such as the calculation of relative potencies from several related dose response curves, then you should consider using non-linear optimization software or specialist dose-response analysis software such as Bliss. The latter is a FORTRAN routine written by David Finney and Ian Craigie, it is available from Edinburgh University Computing Centre. If you are considering using Bliss then you must be familiar with FORTRAN and the basic principles of probit analysis (ref 18, 19). For more general non-linear model fitting with the ability to constrain curves to "parallelism" then I advise you to use MLP or Genstat. At this point most people should seek statistical help. More information is available under the notes on ¬non-linear models╪156043 ¬. CAUTION: Please do not think of probit analysis as a "cure all" for dose response curves. Many log dose - response relationships are clearly not Gaussian sigmoids. They may not be any of the other sigmoids either, e.g. angular, Wilson-Worcester or Cauchy-Urban. You may not be able to use a regression model "off the shelf". This brings us to the complex subject of non-linear modelling. At this point most people should seek statistical help. Please refer to the notes on ¬non-linear models╪156043 ¬. CAUTION 2: Please remember that this form of probit analysis is designed to handle only quantal responses with binomial error distributions. Quantal data, such as the number of subjects responding vs total number of subjects tested, usually have binomial error distributions. You must NOT use continuous data, such as % maximal response, with probit analysis as these data require regression methods which assume a different error distribution. Again, at this point most people should seek statistical help. Please refer to the notes on ¬non-linear models╪156043 ¬. EXAMPLE (from Finney ref 18 p 98): The following data represent a study of the age at menarche of 3918 Warsaw girls. For each age group you are given mean age, total number of girls and the number of girls who had reached menarche. Age Girls + Menses 9.21 376 0 10.21 200 0 10.58 93 0 10.83 120 2 11.08 90 2 11.33 88 5 11.58 105 10 11.83 111 17 12.08 100 16 12.33 93 29 12.58 100 39 12.83 108 51 13.08 99 47 13.33 106 67 13.58 105 81 13.83 117 88 14.08 98 79 14.33 97 90 14.58 120 113 14.83 102 95 15.08 122 117 15.33 111 107 15.58 94 92 15.83 114 112 17.58 1049 1049 To analyse these data in Arcus you must first prepare them in three worksheet columns appropriately labelled. Then select probit analysis from the regression and correlation menu of the analysis section. "Dose" levels here are the mean ages, number in each group are the number of girls and number responding are the number + menses. Select probit as the sigmoid model. Then select a 95% confidence interval by pressing the enter key when you see the confidence interval menu. Select N when asked whether or not you require logarithmic conversion of the independent variable (mean ages). For this example: Y = -6.8189 + 0.9078 X in probits heterogeneity of deviations from model p = 0.5262 ED50: The estimated median age at menarche = 13.02 (95% CI = 12.94 to 13.09) Having looked at a plot of this model and accepted it as appropriate we can conclude with 95% confidence that the true population value for median age at menarche in Warsaw lay between 12.94 and 13.09 years when this study was done. ¬╪29175 ¬ ¬<confidence intervals>╪31897 ¬ ¬<non-linear models>╪156043 ¬ ¬<reference list>╪310584 ¬ |Non-Linear Models| Biomedical research reveals many relationships which are inherently non-linear. One way of dealing with this is to transform variables so that the relationship between them approximates linearity. This works well in many cases but is not possible in others. One of the greatest problems you face when fitting transformed variables is that errors you assumed to be normal in the non-transformed variable become non-normal after transformation. In specific cases such as the probit analysis in Arcus, this has been anticipated and the error calculations have been designed to cope with the expected error distribution. It is not advisable to feed transformed variables through linear regression. If you are confident of a particular model then you are justified in using a generalised linear model method to fit your data. Examples of this are probit analysis and logistic regression. Please note that the current version of Arcus Pro-Stat does not offer multiple logistic regression. A multiple logistic regression module is under development for the next version of Arcus Pro-Stat. SAS, Genstat and GLIM have logistic regression functions. If you need to develop a non-linear model for your data then you MUST know what you are doing. This is a highly complex area which blends gut feeling, art and science. Please seek statistical advice if you want to build non-linear models. It is not the place of Arcus to cover this large and highly specialised field, you should seek out a well validated non-linear estimation package that is supported by experts in the field. The only such packages I have found are MLP and Genstat. The former is a dedicated non-linear estimation package of academic excellence from Gavin Ross at Rothamsted Experimental Station. He is widely published in this field and in my opinion both MLP and his book (ref 34) represent the state of the art in practical non-linear modelling. Genstat is a general stats package which includes many of the functions of MLP because it also comes from Rothamstead. Genstat is not as easy to use as Arcus but it covers a number of specialist areas which Arcus does not. I would recommend Genstat as a good partner to Arcus. Nota Bene!! PLEASE BEWARE OF PACKAGES WHICH CLAIM TO BE "BLACK BOXES" FOR NON-LINEAR MODELLING, THIS IS NOT POSSIBLE AT PRESENT (1994). For more information on GenStat, MLP or GLIM please contact the Numerical Algorithms Group on UK (0)865 511 245. |Analysis of Variance| ¬<One way>╪162596 ¬ ¬<Two way>╪164960 ¬ ¬<Two way with replicates>╪167988 ¬ ¬<Crossover>╪171202 ¬ ¬<Kruskal-Wallis>╪174413 ¬ ¬<Friedman>╪177293 ¬ Analysis of variance (ANOVA) represents a group of methods for investigating how the means of variables are affected by the way in which those variables are classified. In practical terms, you can test for an overall difference between the population means for a group of samples within the constraints of a given experimental design. Arcus then allows you to make individual comparisons between each of the groups using methods which have been designed for the multiple comparison or simultaneous inference situation. When multiple comparisons are made you are in danger of type I error when using t tests alone, thus, more conservative approaches are required. Arcus offers you the methods due to Scheffé, Newman-Keuls and gives Bonferroni's limitation with the t tests (ref 4, 13, 22). With the Newman-Keuls method, means are first ordered in sequence then each possible discrete comparison is made. The probability associated with the resultant q values are then derived from the Studentized range. For Scheffé's test all possible linear contrasts are also made automatically. Please note that this is a controversial area in statistics and you would be wise to seek the advice of a statistician before you design your study. In general you should design experiments so that you can avoid having to "dredge" groups of data for differences, decide which contrasts you are interested in at the outset. An excellent account of ANOVA is given by Armitage & Berry (ref 4). The nonparametric alternatives to ANOVA are also covered in this section. BEYOND ARCUS: If each treatment/exposure factor in your design contains sub-factors of treatment/exposure groups then you should consider a nested hierarchical analysis of variance. This design is not covered by the present version of Arcus, SAS gives a reasonably good implementation of it. Hospital 1 Hospital 2 * * ward 1 ward 2 ward 3 ward 1 ward 2 ward 3 x x x x x x <--- patients x x x x x x x x x x x x x x x x x x x x x x If your design represents repeated exposures/treatments for two different categorisations then you should consider a Latin square design. An example of this is the response of 5 different rats (factor 1) to 5 different treatments (repeated blocks) when housed in 5 different types of cage (factor 2). Rat 1 2 3 4 5 Cage 1 A E C D A 2 E B A B C 3 C D E D D 4 D C B C B 5 B A D A E For designs with complete missing blocks you should consider a balanced incomplete block design provided the number of missing blocks does not exceed the number of treatments. Block 1 2 3 4 Treatment A x x x B x x x C x x x D x x x If all factor levels in a design are of intrinsic interest rather than some form of randomised blocking then you should consider a factorial design. Factorial ANOVA can combine levels into treatments, a simple application of this is the crossover ANOVA which is offered by Arcus. More complex factorial designs require careful planning and I would advise you to seek statistical advice at this stage. These ANOVA designs are not covered by the current version of Arcus. SAS offers a range of complex ANOVA's and BMDP covers most. |One Way| Imagine you have four groups of data which represent one experiment performed on four different occasions with ten different subjects each time. You can test the consistency of the experimental conditions or the inherent error of the experiment using a one way analysis of variance. This assumes that each group comes from an approximately normal distribution and that the variability within the groups is roughly constant. The factors are arranged so that experiments are columns and subjects are rows, this is how you must enter your data in the Arcus worksheet. The F test is fairly robust to small deviations from these assumptions but you could use the ¬Kruskal-Wallis╪174413 ¬ test if there was any doubt. A significant test indicates a difference between the population means for the groups as a whole. You may then go on to make ¬multiple contrasts╪180310 ¬ between the groups but this "dredging" should be avoided if possible. If the groups in this example had been a series of treatments / exposures to which subjects (blocks) were randomly allocated then a two way randomised block design ANOVA should have been used. EXAMPLE (from Armitage ref 4 p 193): The following data represent the numbers of worms isolated from the GI tracts of four groups of rats in a trial of carbon tetrachloride as an anthelminthic. These four groups were the control (untreated) groups: Expt 1 Expt 2 Expt 3 Expt 4 279 378 172 381 338 275 335 346 334 412 335 340 198 265 282 471 303 286 250 318 To analyse these data in Arcus you must first prepare them in four worksheet columns appropriately labelled. Then select the one way function from the analysis of variance menu of the analysis section. Enter the number of groups as four. For this example: F = 2.27 p= 0.1195 The null hypothesis that there is no difference in mean worm counts across the four groups is held. If we had rejected this null hypothesis then we would have had to take a close look at the experimental conditions to make sure that all control groups were exposed to the same conditions. ¬╪29175 ¬ ¬<multiple contrasts>╪180310 ¬ ¬<analysis of variance>╪158578 ¬ ¬<reference list>╪310584 ¬ |Two Way| If your data are classified simultaneously by two factors such that each level of one factor can be combined with all levels of the other factor then a two way ANOVA might be appropriate. If one of these factors represents treatments/ exposures and the other represents experimental subjects which have been randomly allocated to each of these treatments then you are justified in using a randomised block design. The factors are arranged so that treatments are columns and subjects are rows, this is how you must enter your data in the Arcus worksheet. The warnings above concerning multiple comparison methods apply here also. There is no special provision for substitution of missing data in the simple two way ANOVA, a row containing a missing value is simply left out of the analysis. If you wish to use a two way ANOVA but your data are clearly non-normal then you should consider the nonparametric alternative due to Milton ¬Friedman╪177293 ¬. EXAMPLE (from Armitage ref 4 p 218): The following data represent clotting times (mins) of plasma from eight subjects treated in four different ways. The eight subjects (blocks) were allocated at random to each of the four treatment groups: Treatment 1 Treatment 2 Treatment 3 Treatment 4 8.4 9.4 9.8 12.2 12.8 15.2 12.9 14.4 9.6 9.1 11.2 9.8 9.8 8.8 9.9 12 8.4 8.2 8.5 8.5 8.6 9.9 9.8 10.9 8.9 9 9.2 10.4 7.9 8.1 8.2 10 To analyse these data in Arcus you must first prepare them in four worksheet columns appropriately labelled. Then select two way from the analysis of variance menu of the analysis section. Enter the number of groups as four. For this example: F (VR between subjects) = 17.2042 P < 0.0001 *** F (VR between groups) = 6.61503 P = 0.0025 ** Newman-Keuls Multiple Comparisons Treatment 4 vs Treatment 3 Q = 3.798024 P = 0.0140 * Treatment 4 vs Treatment 2 Q = 4.583823 P = 0.0106 * Treatment 4 vs Treatment 1 Q = 6.024452 P = 0.0020 ** Treatment 3 vs Treatment 2 Q = .7857996 P = 0.4155 Treatment 3 vs Treatment 1 Q = 2.226428 P = 0.2785 Treatment 2 vs Treatment 1 Q = 1.440628 P = 0.3201 Here we can see that there was a statistically highly significant difference between mean clotting times across the groups. The difference between subjects is of no particular interest here. The ¬multiple contrasts╪180310 ¬ show us that the mean clotting time for group four is statistically significantly different from the other three which are not significantly separated from each other. ¬╪29175 ¬ ¬<multiple contrasts>╪180310 ¬ ¬<analysis of variance>╪158578 ¬ ¬<reference list>╪310584 ¬ |Two Way with Replicates| The simple two way randomised block design assumes that the row (subject) and column (group) effects are additive. This means that apart from experimental error, the difference in effect between any two rows is the same for all columns and vice versa. If these effects are not additive then there exists a row -column interaction which must be investigated by repeating the observations for each block. These data can then be analysed using this two way randomised block design ANOVA for repeated observations. Arcus will compensate for missing observations in the replicates by estimating them as the mean of the replicates present and by reducing the degrees of freedom, you should avoid this situation if possible. Enter each set of replicates in a separate worksheet column so that there is a different Arcus variable for each cell of the two way table, i.e. the third dimension coming out of the page, the replicates, is as deep as the number of rows for these data in the worksheet. EXAMPLE (from Armitage ref 4 p 221): The following data represent clotting times (mins) from three subjects treated in three different ways. The plasma samples were allocated randomly to the treatments and the analysis was repeated three times for each sample. Treatment A B C Subject 1 9.8 9.9 11.3 10.1 9.5 10.7 9.8 10 10.7 Subject 2 9.2 9.1 10.3 8.6 9.1 10.7 9.2 9.4 10.2 Subject 3 8.4 8.6 9.8 7.9 8 10.1 8 8 10.1 To analyse these data in Arcus you must first prepare them in nine worksheet columns: s = subject t = treatment s1t1 s1t2 s1t3 s2t1 s2t2 s2t3 s3t1 s3t2 s3t3 9.8 9.9 11.3 9.2 9.1 10.3 8.4 8.6 9.8 10.1 9.5 10.7 8.6 9.1 10.7 7.9 8 10.1 9.8 10 10.7 9.2 9.4 10.2 8 8 10.1 Next select the two way with replicates option from the analysis of variance menu of the analysis section. Enter the number of groups as three and the number of subjects as three. For this example: F (VR Subjects) = 63.13918 P < 0.0001 *** F (VR Groups) = 80.32172 P < 0.0001 *** F (VR Interaction) = 2.522677 P = 0.1082 Newman-Keuls Multiple Comparisons Group 3 vs Group 2 Q = 26.22421 P = 0.0002 *** Group 3 vs Group 1 Q = 27.50345 P = 0.0001 *** Group 2 vs Group 1 Q = 1.279235 P = 0.3778 Here we see a statistically highly significant difference between mean clotting times across the groups and more specifically, group 3 stands out from the rest. If the F value for interaction had been significant then there would have been little point in drawing conclusions about independent group and subject effects from the other F values. ¬╪29175 ¬ ¬<multiple contrasts>╪180310 ¬ ¬<analysis of variance>╪158578 ¬ ¬<reference list>╪310584 ¬ |Crossover| If a group of subjects is exposed to two different treatments A and B then a crossover trial would involve half of the subjects being exposed to A then B and the other half to B then A. A washout period is allowed between the two exposures and the subjects are randomly allocated to one of the two orders of exposure. A simple crossover design ANOVA can be applied to these data. The two times when the groups are exposed to the treatments are known as period 1 and period 2. This ANOVA tests for treatment effects, period effects and treatment-period interaction. For further information please refer to Armitage & Berry (ref 4). EXAMPLE (from Armitage ref 4 p224): The following data represent the number of dry nights out of 14 in two groups of bedwetters. The first group were treated with drug X and then a placebo and the second group were treated with the placebo then drug x. An acceptable washout period was allowed between these two treatments. Group I: Drug X Placebo Group II: Placebo Drug X 8 5 12 11 14 10 6 8 8 0 13 9 9 7 8 8 11 6 8 9 3 5 4 8 6 0 8 14 0 0 2 4 13 12 8 13 10 2 9 7 7 5 7 10 13 13 7 6 8 10 7 7 9 0 10 6 2 2 To analyse these data in Arcus you must first prepare them in four worksheet columns appropriately labelled. Then select crossover from the analysis of variance menu of the analysis section. When asked for baseline levels just press Esc for none. Select a 95% confidence interval by pressing the enter key when prompted by the confidence interval menu. For this example: Test for relative effectiveness of drug / placebo: t = 3.526533 P = 0.0007 *** Test for treatment effect: diff 1 - diff 2 = 4.073529 SE = 1.2372 effect magnitude = 2.036765 95% CI = .7679056 to 3.305624 t = 3.292539 DF = 27 P = 0.0014 ** Test for period effect: t = 1.271847 P = 0.1071 Test for treatment / period interaction: t = -1.299673 P = 0.1024 Here the absence of a statistically significant period effect or treatment period interaction enables us to quote the statistically highly significant effect of drug vs placebo. With 95% confidence we can say that the true population value for the magnitude of the treatment effect lies somewhere between 0.77 and 3.31 extra dry nights each fortnight. ¬╪29175 ¬ ¬<analysis of variance>╪158578 ¬ ¬<reference list>╪310584 ¬ |Kruskal-Wallis| test This is a method for comparing k independent random samples and can be used as a nonparametric alternative to the one way ANOVA. In addition to independence within the samples there must be mutual independence between the samples. The data must also have been measured using a scale which is at least ordinal. If the test is significant then you may conclude that at least one of the samples tends to yield larger observations than at least one of the others. In the presence of tied ranks the test statistic is given in adjusted and unadjusted forms, (opinion varies concerning the handling of ties). Approximate probability is evaluated from a chi-square distribution with k-1 degrees of freedom. For small samples you may wish to refer to tables of the Kruskal- Wallis test statistic but the chi-square approximation is highly satisfactory in most cases. If this test achieves significance you are given the chance to make multiple comparisons between the samples. You may choose the level of significance for these comparisons but this is usually α = 0.05 which is the default on pressing the enter key. All possible comparisons are made and the probability of each presumed "non-difference" is indicated. For further information about this method please refer to Conover (ref 6). EXAMPLE (from Conover ref 6 p 231): The following data represent corn yields per acre from four different fields where different farming methods were used. Method 1 Method 2 Method 3 Method 4 83 91 101 78 91 90 100 82 94 81 91 81 89 83 93 77 89 84 96 79 96 83 95 81 91 88 94 80 92 91 81 90 89 84 To analyse these data in Arcus you must first prepare them in four worksheet columns appropriately labelled. Then select Kruskal-Wallis from the analysis of variance menu of the analysis section. Enter the number of groups as four. For this example: Adjusted for ties: T = 25.62883 P < 0.0001 *** Method 1 and Method 2 P = 0.0078 ** Method 1 and Method 3 P = 0.0044 ** Method 1 and Method 4 P < 0.0001 *** Method 2 and Method 3 P < 0.0001 *** Method 2 and Method 4 P = 0.0001 *** Method 3 and Method 4 P < 0.0001 *** From the overall T we see a statistically highly significant tendency for at least one group to give higher values than at least one of the others. Subsequent contrasts show a significant separation of all groups. ¬╪29175 ¬ ¬<analysis of variance>╪158578 ¬ ¬<reference list>╪310584 ¬ |Friedman| Test This method compares several related samples and can be used as a nonparametric alternative to the two way ANOVA. It is assumed that the results within one block do not influence the results within other blocks. If the test is significant then at least one of the treatments tends to yield larger observations than at least one of the other treatments. The power of this method is low with small samples but it is the best method for nonparametric two way analysis of variance with sample sizes above five. When the test is significant Arcus allows you to make multiple comparisons between the individual samples. These comparisons are performed automatically for all possible contrasts and you are informed of the statistical significance of each contrast. Please note that the overall test statistic is T2 as defined by Inman and Davenport (1980) and this is tested against the f distribution. Older literature advocates the use of T3 tested against the chi-square distribution but this has been shown to be an inferior approach. For further information please refer to Conover (ref 6). EXAMPLE (from Conover ref 6 301): The following data represent the rank preferences of twelve home owners for four different types of grass planted in their gardens for a trial period. They considered defined criteria before ranking each grass between 1 (best) and 4 (worst). Grass 1 Grass 2 Grass 3 Grass 4 4 3 2 1 4 2 3 1 3 1.5 1.5 4 3 1 2 4 4 2 1 3 2 2 2 4 1 3 2 4 2 4 1 3 3.5 1 2 3.5 4 1 3 2 4 2 3 1 3.5 1 2 3.5 To analyse these data in Arcus you must first prepare them in four worksheet columns appropriately labelled. Then select Friedman from the analysis of variance menu in the analysis section. Enter the number of groups as four. For this example: T2 = 3.192198 P = 0.0362 * Grass 1 - Grass 2 P = 0.0149 * Grass 1 - Grass 3 P = 0.0226 * Grass 1 - Grass 4 P = 0.4834 Grass 2 - Grass 3 P = 0.8604 Grass 2 - Grass 4 P = 0.0717 Grass 3 - Grass 4 P = 0.1017 From the overall test statistic we can conclude that there is a statistically significant tendency for at least one group to yield higher values than at least one of the other groups. Considering the raw data and the contrast results we see that grasses 2 and 3 are significantly preferred above grass 1 but that there is little to choose between 2 and 3. ¬╪29175 ¬ ¬<analysis of variance>╪158578 ¬ ¬<reference list>╪310584 ¬ |Multiple Contrasts| and ANOVA The multiple contrast or simultaneous inference situation arises when you want to make pairwise comparisons between many groups after an analysis of variance. When multiple comparisons are made you are in danger of type I error using t tests alone, therefore, more conservative approaches are required. Arcus offers you methods due to Scheffé, Newman-Keuls and gives Bonferroni's limitation with t tests (ref 4, 13, 22). With the Newman-Keuls method, means are first ordered in sequence then each possible discrete comparison is made. The probability associated with the resultant q values are then derived from the Studentized range. For Scheffé's test all possible linear contrasts are also made automatically. Please note Scheffé's is the most conservative method of all. In the presence of a control group some authors recommend Dunnett's method and there are more powerful contrast methods for controls such as that due to the late D. A. Williams. These are not presently offered by Arcus but you CAN use one of the methods which are in the current version of Arcus, they will just be a little more conservative. I recommend the Newman-Keuls method for general use. It is the most soundly justifiable approach for most multiple contrast situations. You will not find it in many other stats packages because it is difficult to program and no other reason (ref 4, 22). This is a controversial area in statistics and you would be wise to seek the advice of a statistician before you design your study. In general you should design experiments so that you can avoid having to "dredge" groups of data for differences, decide which contrasts you are interested in at the outset. If you can identify contrasts at the design stage of an experiment then subsequent use of t tests is justified provided the basic assumptions of the t test are met. ¬<analysis of variance>╪158578 ¬ |Survival Analysis| ¬<Kaplan-Meier>╪182964 ¬ ¬<Simple life table>╪194318 ¬ ¬<Log-rank and Wilcoxon>╪199215 ¬ ¬<Wei-Lachin>╪206225 ¬ This section offers facilities for the description and comparison of survival experience in different groups. Unlike other Arcus functions the survival analysis section does not use separate variables for different groups. The groups are indicated by a group variable which contains group identifiers, i.e. for 2 groups you would have a column of 1's and 2's in the worksheet. Each of the data in this column (variable) give a group identity to their rows with respect to time, death and censorship data in adjacent columns. |Kaplan-Meier| This provides the Kaplan-Meier product limit estimates of the survivor (S) and cumulative hazard (H) functions. Results are displayed for one group at a time and you have the option to save these results as worksheet variables. If you choose to save results as worksheet variables then the results are extended to include confidence intervals for the survivor and cumulative hazard functions. The variance estimates are approximations based on Greenwood's formula, these may differ slightly from results obtained using other packages. The confidence interval for the survivor function is not a simple application of Greenwood's variance approximation because this would give impossible results (< 0 or > 1) at extremes of S. The confidence interval for S uses an asymptotic maximum likelihood solution by the transformation recommended by Kalbfleisch and Prentice (ref 25). You are also given the option to plot these functions. Four different plots are given and certain distributions are indicated if these plots display linearity (ref 24, 25). The plots and their associated distributions are: PLOT DISTRIBUTION INDICATED IF LINEAR H vs Time Exponential, through the origin with slope lambda ln(H) vs ln(Time) Weibull, intercept beta and slope ln(lambda) Z(S) vs ln(Time) Log-normal H/Time vs Time Linear hazard rate DEFINITIONS: Let survival time = time to event/failure (here = death) S = survivor function H = hazard function S = │estimated probability of surviving day t│ x │estimated % surviving up to│% │for those alive at start of day t. │ │day t. │ H = risk of death at time t BEYOND ARCUS: Arcus offers you the basic construction of survivor and hazard estimates with their confidence intervals. If you want to go further and fit models to these functions then you require specialist software. At this point most researchers should seek statistical advice. You should aim to fit these models using a maximum likelihood procedure. Beware, you might need to construct a novel non-linear model for your data. The commonest model is exponential but Weibull, log-normal, log-logistic and Gamma often appear. If the hazard function is constant over time then a plot of log hazard function vs time will be linear with slope lambda. If this is true then you have the useful relationship Probability(survival > t) = exp(-lambda * t). This eases the calculation of relative risk from the ratio of hazard functions at time t on two survival curves. When the hazard function depends on time then you can usually calculate relative risk after fitting Cox's proportional hazards model. This model assumes that for each group the hazard functions are proportional at each time, it does not assume any particular distribution function for the hazard function. Proportional hazards modelling can be very useful, however, most researchers should seek statistical guidance with this. SAS includes some good routines for modelling survival data but you might require Genstat, GLIM or MLP for more exploratory work. EXAMPLE (from Kalbfleisch & Prentice ref 25, p 14): Death from vaginal cancer after exposure to the carcinogen DMPA was measured in two groups of rats. Group 1 had a different DMPA pre-treatment régime to group 2. The time from pre-treatment to death is recorded. If a rat was still living at the end of the experiment or it had died from a different cause then that time is considered "censored". A censored observation is given the value 0 in the death/censorship variable to indicate a "non-event". Group 1: 143, 164, 188, 188, 190, 192, 206, 209, 213, 216, 220, 227, 230, 234, 246, 265, 304, 216*, 244* Group 2: 142, 156, 163, 198, 205, 232, 232, 233, 233, 233, 233, 239, 240, 261, 280, 280, 296, 296, 232, 204*, 344* * = censored data To analyse these data in Arcus you must first prepare them in three worksheet columns appropriately labelled: Group Time Death/Censorship 2 142 1 1 143 1 2 156 1 2 163 1 1 164 1 1 188 1 1 188 1 1 190 1 1 192 1 2 198 1 2 204 0 2 205 1 1 206 1 1 209 1 1 213 1 1 216 0 1 216 1 1 220 1 1 227 1 1 230 1 2 232 1 2 232 1 2 232 1 2 233 1 2 233 1 2 233 1 2 233 1 1 234 1 2 239 1 2 240 1 1 244 0 1 246 1 2 261 1 1 265 1 2 280 1 2 280 1 2 296 1 2 296 1 1 304 1 2 323 1 2 344 0 Then select the Kaplan-Meier function from the survival analysis menu of the analysis section. Select Y when you are asked whether or not you want to save various statistcs to the worksheet. Select a 95% confidence interval by pressing enter when prompted with the confidence interval menu. Select Y when you are prompted about displaying plots. For Group 1: Here are the product limit estimates of survival and hazard to the times observed in the experiment: Time At Risk Dead Censored S Var S H Var H 143 19 1 0 0.94737 0.00262 0.05407 0.00292 164 18 1 0 0.89474 0.00496 0.11123 0.00619 188 17 2 0 0.78947 0.00875 0.23639 0.01404 190 15 1 0 0.73684 0.01021 0.30538 0.0188 192 14 1 0 0.68421 0.01137 0.37949 0.02429 206 13 1 0 0.63158 0.01225 0.45953 0.0307 209 12 1 0 0.57895 0.01283 0.54654 0.03828 213 11 1 0 0.52632 0.01312 0.64185 0.04737 216 10 1 1 0.47368 0.01312 0.74721 0.05848 220 8 1 0 0.41447 0.01311 0.88075 0.07634 227 7 1 0 0.35526 0.01264 1.0349 0.10015 230 6 1 0 0.29605 0.0117 1.21722 0.13348 234 5 1 0 0.23684 0.01029 1.44036 0.18348 244 4 0 1 0.23684 0.01029 1.44036 0.18348 246 3 1 0 0.15789 0.00873 1.84583 0.35015 265 2 1 0 0.07895 0.0053 2.53897 0.85015 304 1 1 0 0 0 ∞ 0 And with 95% confidence interval for S... Time At Risk Survivor (S) 95% LCI S 95% UCI S 143 19 .9473684 .6811868 .9924147 164 18 .8947369 .6407944 .9725854 188 17 .7894737 .5319126 .9152861 190 15 .7368422 .4789329 .8810194 192 14 .6842106 .4279407 .8439419 206 13 .631579 .3789929 .804409 209 12 .5789474 .3320811 .76264 213 11 .5263159 .2872013 .7187639 216 10 .4736843 .2443767 .6728407 220 8 .4144737 .1961606 .6211132 227 7 .3552632 .1519129 .5664639 230 6 .2960527 .1116839 .5087005 234 5 .2368421 7.577927E-02 .4474698 244 4 .2368421 7.577927E-02 .4474698 246 3 .1578947 3.143191E-02 .3735425 265 2 7.894737E-02 5.665417E-03 .2876329 304 1 0 0 0 Below is the classical "survival plot" showing how survival declines with time. If you want a high resolution plot of this then feed the data saved to the worksheet through the survival plot function of the pictorial statistics menu. Survivor 1.00+ │B │A B │ BA │ B . 0.75+ A B │ A │ A │ A B │ A 0.50+ A │ A │ A B │ A B │ B 0.25+ A B │ A . │ A B │ │ A B 0.00+ A B . /+────────-+────────-+────────-+────────-+────────-+────────-+ 140 180 220 260 300 340 380 Times The approximate linearity of the log hazard vs log time plot below indicates a Weibull distribution of survival. Log Hazard 1.70+ │ │ │ B . │ A B 0.45+ A B │ A . B │ AA B │ AA B │ A -0.80+ AA B │ A │ A │ A B │ B . -2.05+ B │ A │ B │ │A -3.30+B /+────────-+────────-+────────-+────────-+────────-+────────-+ 4.95 5.10 5.25 5.40 5.55 5.70 5.85 Log Times At this point you might be wanting to run a formal hypothesis test to see if there is any statistical evidence for two or more survival curves being different. This can be achieved using sensitive parametric methods if you have fitted a particular distribution curve to your data. More often you would use the ¬Log-rank and Wilcoxon╪199215 ¬ tests which do not assume any particular distribution of the survivor function. ¬<confidence intervals>╪31897 ¬ ¬<reference list>╪310584 ¬ |Simple Life Table| This function provides a simple life table which displays the survival experience of a group of individuals or cohort, this is much like the table originally proposed by Berkson and Gage (ref 4, 5, 24, 25). The confidence interval for lx is not a simple application of the estimated variance. Instead it uses a maximum likelihood solution from an asymptotic distribution by the transformation of lx suggested by Kalbfleisch and Prentice (ref 25). This treatment of lx avoids impossible values (i.e. >1 or <0). DEFINITIONS: INTERVAL For a full life table this is ages in single years. For an abridged life table this is ages in groups. For a Berkson and Gage survival table this is the survival times in intervals. DEATHS Number of individuals who die in the interval. W'DRAWN Number of individuals withdrawn or lost to follow up in the interval. AT RISK Number of individuals alive at the start of the interval. N'x Adjusted number at risk (half of withdrawals of current interval subtracted). q Probability that an individual who survived the last interval will die in the current interval. p Probability that an individual who survived the last interval will survive the current interval. lx Probability of an individual surviving beyond the current interval. Proportion of survivors after the current interval. Life table survival rate. Var(lx) Estimated variance of lx. X% LCI lx Lower x% confidence interval for lx. X% UCI lx Upper x% confidence interval for lx. EXAMPLE (from Armitage ref. 5 p 425): The following data represent the survival of a 374 patients who had one type of surgery for a particular malignancy: Years since operation Died in this interval Lost to follow up 1 90 0 2 76 0 3 51 0 4 25 12 5 20 5 6 7 9 7 4 9 8 1 3 9 3 5 10 2 5 To analyse these data in Arcus you must first prepare them in three worksheet columns appropriately labelled. Then select the simple life table from the survival analysis menu of the analysis section. Enter the number at the start as 374. Select a 95% confidence interval by pressing enter when prompted by the confidence interval menu. For this example: Interval Deaths W'drawn At Risk N'x q p 0- 90 0 374 374 0.2406417 0.7593583 1- 76 0 284 284 0.2676056 0.7323943 2- 51 0 208 208 0.2451923 0.7548077 3- 25 12 157 151 0.1655629 0.8344371 4- 20 5 120 117.5 0.1702128 0.8297873 5- 7 9 95 90.5 0.07734807 0.9226519 6- 4 9 79 74.5 0.05369128 0.9463087 7- 1 3 66 64.5 0.01550388 0.9844961 8- 3 5 62 59.5 0.05042017 0.9495798 9- 2 5 54 51.5 0.03883495 0.9611651 10- - - 47 - - - Interval p lx Var(lx) 95% LCI lx 95% UCI lx 0- 0.7593583 1 - - - 1- 0.7323943 0.7593583 0.00048859 0.7127129 0.7995125 2- 0.7548077 0.5561497 0.00066002 0.5042839 0.6048234 3- 0.8344371 0.4197861 0.00065125 0.3694556 0.4692234 4- 0.8297873 0.3502851 0.00061468 0.3020018 0.3988916 5- 0.9226519 0.2906621 0.00057073 0.2447156 0.33805 6- 0.9463087 0.26818 0.00055247 0.2232208 0.3150406 7- 0.9844961 0.253781 0.00054379 0.2093514 0.3004384 8- 0.9495798 0.2498464 0.0005423 0.2055291 0.2964883 9- 0.9611651 0.2372491 0.00053922 0.1932333 0.2839237 10- - 0.2280356 0.00053895 0.1932333 0.2839237 Thus we can conclude with 95% confidence that the true population survival rate 5 years after this operation lies somewhere between 24.5% and 33.8% for patients who present with this malignancy. ¬<confidence intervals>╪31897 ¬ ¬<reference list>╪310584 ¬ |Log-Rank and Wilcoxon| These are two methods for comparing two or more survival curves. These methods do not make any assumptions about the distributions of the survival estimates which comprise the curves. The null hypothesis that the risk of death is the same in all groups is tested. Peto's log-rank test is generally the most appropriate method but the modified Wilcoxon test is more sensitive when the ratio of hazards is higher at early survival times than at late ones. An optional variable, strata, allows you to sub-classify the groups specified in the group identifier variable and to test the significance of this sub-classification (ref 4, 24, 25). EXAMPLE (from Armitage ref 4 p 431): The following data represent the survival in days since entry to the trial of patients with diffuse histiocytic lymphoma. Two different groups of patients, those with stage III and those with stage IV disease, are compared. Stage 3: 6, 19, 32, 42, 42, 43*, 94, 126*, 169*, 207, 211*, 227*, 253, 255*, 270*, 310*, 316*, 335*, 346* Stage 4: 4, 6, 10, 11, 11, 11, 13, 17, 20, 20, 21, 22, 24, 24, 29, 30, 30, 31, 33, 34, 35, 39, 40, 41*, 43*, 45, 46, 50, 56, 61*, 61*, 63, 68, 82, 85, 88, 89, 90, 93, 104, 110, 134, 137, 160*, 169, 171, 173, 175, 184, 201, 222, 235*, 247*, 260*, 284*, 290*, 291*, 302*, 304*, 341*, 345* * = censored data (patient still alive or died from an unrelated cause) To analyse these data in Arcus you must first prepare them in three worksheet columns as shown below: group time censor 1 6 1 1 19 1 1 32 1 1 42 1 1 42 1 1 43 0 1 94 1 1 126 0 1 169 0 1 207 1 1 211 0 1 227 0 1 253 1 1 255 0 1 270 0 1 310 0 1 316 0 1 335 0 1 346 0 2 4 1 2 6 1 2 10 1 2 11 1 2 11 1 2 11 1 2 13 1 2 17 1 2 20 1 2 20 1 2 21 1 2 22 1 2 24 1 2 24 1 2 29 1 2 30 1 2 30 1 2 31 1 2 33 1 2 34 1 2 35 1 2 39 1 2 40 1 2 41 0 2 43 0 2 45 1 2 46 1 2 50 1 2 56 1 2 61 0 2 61 0 2 63 1 2 68 1 2 82 1 2 85 1 2 88 1 2 89 1 2 90 1 2 93 1 2 104 1 2 110 1 2 134 1 2 137 1 2 160 0 2 169 1 2 171 1 2 173 1 2 175 1 2 184 1 2 201 1 2 222 1 2 235 0 2 247 0 2 260 0 2 284 0 2 290 0 2 291 0 2 302 0 2 304 0 2 341 0 2 345 0 Next select the Log-rank and Wilcoxon function from the survival analysis menu of the analysis section. For this example: relative death rate for stage 3 = .4794143 relative death rate for stage 4 = 1.232816 Log-rank test Chi-square for equivalence of death rates = 6.70971 P = 0.0096 ** Generalised Wilcoxon test Chi-square for equivalence of death rates = 3.936735 P = 0.0472 * You can see that both tests have demonstrated a statistically significant difference in survival experience between stage 3 and stage 4 patients in this study. Stratified example: (from Peto et al. ref 40) Group Identifier Trial Times Censorship (Strata, optional) 1 8 1 (event = death) 1 (renal impairment) 1 8 1 2 (no renal impairment) 2 13 1 1 2 18 1 1 2 23 1 1 1 52 1 1 1 63 1 1 1 63 1 1 2 70 1 2 2 70 1 2 2 180 1 2 2 195 1 2 2 210 1 2 1 220 1 2 1 365 0 (lost to f.u.) 2 2 632 1 2 2 700 1 2 1 852 0 (surviving) 2 2 1296 1 2 1 1296 0 2 1 1328 0 2 1 1460 0 2 1 1976 0 2 2 1990 0 2 2 2240 0 2 The table above shows you how to prepare data for a stratified log-rank test in Arcus. This example is worked through in the second of two classic papers by Richard Peto and colleagues (ref 39, 40). If you want to understand survival analysis then I strongly advise you to read these two papers. Please note that Arcus uses the more exact variance formulae mentioned in the statistical notes section at the end of ref 40. ¬╪29175 ¬ ¬<reference list>╪310584 ¬ |Wei-Lachin| This provides a two sample distribution free analysis for the comparison of two multivariate distributions of survival / time-to-event data which may be incomplete / censored. The method uses the random censorship model to apply generalisations of the log-rank test and the Gehan generalised Wilcoxon test. (ref A21, 26). Arcus asks you for a group identifier variable which should be a vector of 1's and 2's representing the two groups. You then identify n pairs of time-to-event and censorship variables for the n repeat times which you have specified. Censored data are coded as 0 and 1 represents uncensored data in the censorship variable. Repeat times may represent separate factors or the observation of the same factor repeated on n occasions. For example, time to develop symptoms could be analysed for n different symptoms in a group of patients treated with drug x and compared with a group of patients not treated with drug x. For further details please refer to the excellent paper by Robert Makuch et. al. from which this Arcus function was developed (Ref A21). EXAMPLE (from Makuch ref A21): The following data represent the times in days it took in vitro cultures of lymphocytes to reach a level of p24 antigen expression. The cultures where taken from patients infected with HIV-1 who had advanced AIDS or AIDS related complex. The idea was that patients whose cultures took a short time to express p24 antigen had a greater load of HIV-1. The two groups represented patients on two different treatments. The culture was run for 30 days and specimens which remained negative or which became contaminated were called censored (=0). The tests were run over four 30 day periods: Treatment Time 1 Cens 1 Time 2 Cens 2 Time 3 Cens 3 Time 4 Cens 4 Group 1 8 1 0 0 25 0 21 1 1 6 1 4 1 5 1 5 1 1 6 1 5 1 28 0 18 1 1 14 0 35 0 23 1 19 0 1 7 1 0 0 13 1 0 0 1 5 1 4 1 27 1 8 1 1 5 1 21 0 6 1 14 1 1 6 1 10 1 14 1 18 1 1 7 1 4 1 15 1 8 1 1 6 1 5 1 5 1 5 1 1 4 1 5 1 6 1 3 1 1 5 1 4 1 7 1 5 1 1 21 0 5 1 0 0 6 1 1 13 1 27 0 21 0 8 1 1 4 1 27 0 7 1 6 1 1 6 1 3 1 7 1 8 1 1 6 1 0 0 5 1 5 1 1 6 1 0 0 4 1 6 1 1 7 1 9 1 6 1 7 1 1 8 1 15 1 8 1 0 0 1 18 0 27 0 18 0 9 1 1 16 1 14 1 14 1 6 1 1 15 1 9 1 12 1 12 1 2 4 1 5 1 4 1 3 1 2 8 1 22 1 25 0 0 0 2 6 1 6 1 8 1 5 1 2 7 1 10 1 10 1 18 1 2 5 1 14 1 17 0 6 1 2 3 1 5 1 8 1 6 1 2 6 1 11 1 6 1 13 1 2 6 1 0 0 15 1 7 1 2 6 1 12 1 19 1 8 1 2 6 1 25 0 0 0 22 0 2 4 1 7 1 5 1 7 1 2 5 1 7 1 4 1 6 1 2 3 1 9 1 7 1 6 1 2 9 1 17 1 0 0 21 0 2 6 1 4 1 8 1 14 1 2 5 1 5 1 7 1 16 0 2 12 1 18 0 14 1 0 0 2 9 1 11 1 15 1 18 0 2 6 1 5 1 9 1 0 0 2 18 0 8 1 10 1 13 1 2 4 1 4 1 5 0 10 1 2 3 1 10 1 0 1 21 0 2 8 1 7 1 10 1 12 1 2 3 1 6 1 7 1 9 1 To analyse these data in Arcus you must first prepare them in 9 worksheet columns as shown above. Then select the Wei-Lachin function from the survival analysis menu of the analysis section. Enter number of repeat times as 4. For this example: Univariate generalised Wilcoxon tests: repeat time = 1 chi-square = 3.588261 P = 0.0582 repeat time = 2 chi-square = .1071885 P = 0.7434 repeat time = 3 chi-square = .2164523 P = 0.6418 repeat time = 4 chi-square = 1.996144 P = 0.1577 Multivariate generalised Wilcoxon test: chi squared omnibus statistic = 9.242916 P = 0.0553 stochastic ordering chi-square = 9.598206E-02 P = 0.7567 Univariate log-rank tests: repeat time = 1 chi-square = 3.344057 P = 0.0674 repeat time = 2 chi-square = .5345362 P = 0.4647 repeat time = 3 chi-square = .9179572 P = 0.3380 repeat time = 4 chi-square = 2.675657 P = 0.1019 Multivariate log-rank test: chi squared omnibus statistic = 9.52966 P = 0.0491 * stochastic ordering chi-square = .4743826 P = 0.4910 Here the multivariate log-rank test has revealed a statistically significant difference between the treatment groups which was not revealed by any of the individual univariate tests. For more detailed discussion of each result parameter please see Wei and Lachin's original paper (ref 26). ¬╪29175 ¬ ¬<reference list>╪310584 ¬ |Instant Functions| (Non-Worksheet oriented analysis) ¬<Distributions>╪213522 ¬ ¬<Chi-square tests>╪222593 ¬ ¬<Exact tests>╪242962 ¬ ¬<Proportions>╪262904 ¬ ¬<Sample Size>╪256010 ¬ ¬<Randomisation>╪252007 ¬ ¬<Miscellaneous>╪269298 ¬ These functions are referred to as instant because they do not require columns of numbers to have been prepared in advance using the Arcus worksheet. You are prompted for the relevant data within the function. Statistical Probability |Distributions| This section deals with the commonly used statistical probability distributions. Robust, reliable algorithms have been employed to provide a high level of accuracy, thus most tail areas are given to six decimal places. For practical purposes the p values given with hypothesis tests throughout Arcus are displayed to four decimal places. ¬<Normal>╪218237 ¬ ¬<Chi-square>╪218665 ¬ ¬<Student's t>╪219206 ¬ ¬<F (variance ratio)>╪219707 ¬ ¬<Studentized range Q>╪220173 ¬ ¬<Spearman's rho>╪221715 ¬ ¬<Kendall's tau>╪222143 ¬ ¬<binomial>╪220751 ¬ ¬<Poisson>╪221217 ¬ PROBILITY DISTRIBUTIONS ----------------------- Probability exists as a concept to help us predict the chance of something happening (an outcome) based on observations of this outcome in the past. In mathematical language, this outcome is described in terms of a random variable. The random variable can take on different values which represent different outcomes, eg blood pressure. This sort of random variable can be thought of in infinitely small units of measurement where the steps between the units are so small that they become continuous, this is a continuous random variable. The other kind of random variable is called discrete. Discrete random variables take on discrete outcomes such as the number of times an asthmatic patient has been admitted to hospital with an acute exacerbation. If you consider an outcome measured in many different individuals in a population then you are starting to build up a model of this outcome within that population. If you then plot all of the values of this outcome on a histogram you might find a particular shape emerging every time you take a large random sample from this population. With a continuous random variable you can draw a curve around the histogram because it is possible to have values in between any that are measured. With a discrete variable, however, there may only be a few possible outcomes so your histogram will have wide bars with definite steps between them. This is like the difference between a digital signal (steps) and an analogue signal (curves). Now comes the all important linking concept, probability distribution. We have discussed how the different values of an outcome can be plotted on a histogram with some values occurring more frequently than others. Thus the commonly occurring values have a higher probability of being observed when you take a random sample of that population. def A probability distribution of a random variable is a table, graph or mathematical expression giving the probabilities with which the random variable takes different values. Putting numbers to this concept involves more thought about populations. Think of a graph of probability (p) plotted against the value of outcome (x). A probability distribution would include all possible values for x. The sum of p for all possible values of x is defined as 1. For discrete variables this is literally a simple summation but for continuous variables the number of possible values of x is infinite so we use integration to estimate the area under the curve. This area is of course 1 for the total curve. Now consider one value of x. You can use the probability distribution for x to estimate the chance of observing that x at random in the population. For discrete distributions we do literally calculate p but for continuous distributions we consider a partial area under the curve or probability density function which represents the probability that x lies between 2 specified values. Most of the time you will be dealing outcomes which are values of a statistic calculated as a test of some hypothesis. The so called test statistic can usually be compared with one of the standard probability distributions. The p value derived from this test statistic is then used to accept or refute the test hypothesis with an accepted level of certainty. This sort of result often gives a false sense of security as it says nothing about the assumptions of your test. The use of confidence intervals gives a more realistic representation of a test result but it most certainly does NOT reflect a test used with invalid assumptions. Please read the help text regarding assumptions when you are using any of the hypothesis tests in Arcus. Discrete distributions: eg Binomial, Poisson Continuous distributions: eg Normal, Chi-square, Student's t, F If you need more information about probability and sampling theory then please consult one of the introductory or core texts listed in the reference section. |Normal| (Gaussian) The normal distribution is the most important continuous probability distribution. It was first described by De Moivre in 1733 and subsequently by the German mathematician C. F. Gauss (1777 - 1885). Arcus gives you the tail areas and percentage points for this function. Please note that the upper and lower tails are not simply 1.0 minus the other. (ref A3, A4) ¬<Distributions>╪213522 ¬ |Chi-square| The chi-square statistic is related to the sum of squares of a number of standard normal variables and is associated with a positively (left) skewed distribution which approaches symmetry as the sample size increases. Arcus can be used to calculate the probability associated with a chi-square random variable with given degrees of freedom and to calculate the percentage points of this distribution (ref A5). A reliable approach to the incomplete gamma integral is used (ref A16). ¬<Distributions>╪213522 ¬ |Student's t| t represents a family of distributions which are shaped by nu degrees of freedom. When nu is infinite t becomes a normal distribution. This family of distributions is associated with W. S. Gosset who, at the turn of the century, published his work under the pseudonym Student. Arcus uses the relationship between Student's t and Snedecor's f to calculate the tail areas and percentage points of t distributions for given degrees of freedom. ¬<Distributions>╪213522 ¬ |F (variance ratio)| Snedecor's f describes the distribution of variance estimates of two samples, each from a normal distribution. The size of each sample is reflected in the degrees of freedom nu1 and nu2. Arcus calculates tail areas and percentage points for given numerator (nu1) and denominator (nu2) degrees of freedom. Reliable approaches to the beta function are used in these calculations (ref A7, A8, A9, A10). ¬<Distributions>╪213522 ¬ |Studentized Range Q| The Studentized range, Q, is the range of means divided by the estimated standard error for a given group of samples. This is often used in multiple comparison / simultaneous inference methods which accompany analyses of variance. Arcus calculates tail areas and percentage points for a given number of samples and sample sizes. Please note that these calculations are highly complex and will take longer than any of the other distribution functions particularly with large numbers of samples (ref A11, A12). ¬<Distributions>╪213522 ¬ |Binomial| The binomial distribution describes a random variable which is the number of successes in n trials. There must be only two outcomes to the trial, success or failure. Each of the n repetitions of this trial must also be completely independent. Arcus calculates cumulative probabilities for (>=, <=, =) r successes in n trials. Confidence intervals for binomial proportions are given with the Arcus sign test. ¬<Distributions>╪213522 ¬ |Poisson| The Poisson distribution represents the probabilities of r events occurring independently and at random in certain defined circumstances with mean µ. This approximates a binomial distribution when the number of trials is large and the probability of success on each trial is small. Arcus calculates cumulative probabilities that (<=, >=, =) r random events are contained in an interval when the average number of such events per interval is µ. ¬<Distributions>╪213522 ¬ |Spearman's Rho| / Hotelling-Pabst Given a value for the Hotelling-Pabst test statistic (T) or Spearman's rho this function calculates the probability of obtaining a value greater than or equal to T. Upper tail probabilities are calculated using a recurrence method when n < 7 and the Edgeworth series expansion when n >= 7. The maximum error for any probability is 0.0004 (ref A13). ¬<Distributions>╪213522 ¬ |Kendall's Tau| Given a value for the test statistic (S) associated with Kendall's tau this function calculates the probability of obtaining a value greater than or equal to S for a given sample size. Upper tail probabilities are calculated using a recurrence method when n < 9 and an improved Edgeworth series expansion when n >= 9 (ref A14). The two samples are assumed to have been ranked without ties. ¬<Distributions>╪213522 ¬ |Chi-square Tests| ¬<2 by 2>╪223812 ¬ ¬<2 by k>╪230817 ¬ ¬<r by c>╪227666 ¬ ¬<Matched pairs (McNemar, Liddell)╪233702 ¬ ¬<Mantel-Haenszel>╪235832 ¬ ¬<Woolf>╪239133 ¬ Chi-square tests compare observed and expected frequencies of individuals grouped by different categories. Arcus applies the basic chi-square analysis to a number of different contingency table designs. The larger the resultant chi-square statistic (for given degrees of freedom) the more likely there is to be a significant difference between observed and expected frequencies. A null hypothesis that there is no difference between the populations from which you quantify observed and expected frequencies is tested by comparing the calculated chi-square statistic with percentage points of the chi-square distribution. This is valid provided that the numbers are not too small, in general any expected frequency should be greater than five. |Haldane| correction This is a method used to avoid error in the calculation of some of the chi- square tests in Arcus. It involves adding 0.5 to all of the cells of a contingency table if any of the cell expectations would cause a division by zero error. |2 by 2| contingency table chi-square test The two by two or fourfold contingency table is commonly used to compare two proportions. The rows represent two classifications of one variable (e.g. infection/no infection) and the columns represent two classifications of another variable (e.g. antiseptic wash/no antiseptic). These classifications must be independent. Paired results (e.g. same group of individuals before and after antiseptic wash) should be analysed using a test for ¬matched pairs╪233702 ¬. Fisher's exact test should be used as an alternative to the fourfold chi-square test if the total number is less than twenty or any of the expected frequencies are less than five. In practical terms, however, there is little point in using the fourfold chi-square test when Arcus provides you with a Fisher's exact test which can cope with reasonably large numbers. In the fourfold chi-square test you are advised to use the Yates' corrected value as this improves the approximation of your discrete sample chi-square statistic to a continuous chi square distribution (ref 4). The odds ratio of this 2 by 2 table is given and the associated approximate confidence interval (CI) is calculated using two different methods. The CI using the logit method for large samples is given first followed by the CI using Cornfield's method (ref 9, 11). The latter is the most reliable method but the logit method might be more acceptable if a convergent solution has not been achieved with Cornfield's method. EXAMPLE (from Armitage ref 4 p 126): The following represent mortality data for two groups of patients receiving different treatments, A and B. Outcome Dead Alive Treatment / Exposure A 41 216 B 64 180 To analyse these data in Arcus you must select the 2 by 2 contingency table from the chi-square sub-menu of the instant functions menu in the analysis section. Select a 95% confidence interval by pressing the enter key when prompted by the confidence interval menu. Enter the frequencies into the contingency table on screen as shown above. For this example: Observed values and totals: ╔════════════════╤════════════════╤════════════════╗ ║ 41 │ 216 │ 257 ║ ╟────────────────┼────────────────┼────────────────╢ ║ 64 │ 180 │ 244 ║ ╠════════════════╪════════════════╪════════════════╣ ║ 105 │ 396 │ 501 ║ ╚════════════════╧════════════════╧════════════════╝ Expected values: ╔════════════════╤════════════════╗ ║ 53.86227 │ 203.1377 ║ ╟────────────────┼────────────────╢ ║ 51.13773 │ 192.8623 ║ ╚════════════════╧════════════════╝ Yates-corrected Chi² = 7.370595 P = 0.0066 Coefficient of contingency: V = -0.126198 Using Cornfield's Method for a 95% CI: Odds ratio (after ¬Haldane╪223546 ¬ correction) = 0.536423 Upper limit: 0.335953 Lower limit: 0.847064 Here we can see a statistically significant relationship between treatment and mortality. The strength of that relationship is reflected by the coefficient of contingency. The odds ratio tells us that the odds in favour of dying after treatment A are about half of the odds of dying after treatment B. With 95% confidence we put the true population value for this ratio of odds somewhere between 0.34 and 0.85. If you need to phrase the arguments with odds ratios the other way around then just quote the reciprocals, i.e. here we would say that the odds of dying after treatment A are 1.86 times greater than after treatment B. ¬╪29175 ¬ ¬<confidence intervals>╪31897 ¬ ¬<reference list>╪310584 ¬ |R by C| contingency table chi-square test The r by c chi-square test extends the chi-square method to any number of independent categories expressed as r rows and c columns of a contingency table. The overall test indicates the degree of independence between the variables which make up the table. An analysis of trend indicates how much of the difference between the mean scores for the columns can be accounted for by linear trend. Armitage (ref 4) quotes an example where extent of grief of mothers suffering a perinatal death, graded I to IV, is compared with the degree of support received by these women. In this example the overall statistic is non-significant but a significant trend is demonstrated. The largest table for a display of individual results is 8 columns by 10 rows but general results are given for larger tables, with the maximum table size being limited only by your computer's memory. Observed values, expected values and totals are given for the table when c <= 8 and r <= 10. EXAMPLE (from Armitage ref 4 p 378): The following data (as above) describe the state of grief of 66 mums who had suffered a neonatal death. The table relates this to the amount of support given to these women: Support Good Adequate Poor Grief State I 17 9 8 II 6 5 1 III 3 5 4 IV 1 2 5 To analyse these data in Arcus you must select r by c from the chi-square test menu of the instant functions menu in the analysis section. Press N when asked about percentages. Choose a 95% confidence interval by pressing the enter key when prompted by the confidence interval menu. Then select the number of rows as 4 and the number of columns as 3. You then enter the above data as directed by the screen. For this example: Observed 17 9 8 34 Expected 13.91 10.82 9.27 DChi² 0.69 0.31 0.17 Observed 6 5 1 12 Expected 4.91 3.82 3.27 DChi² 0.24 0.37 1.58 Observed 3 5 4 12 Expected 4.91 3.82 3.27 DChi² 0.74 0.37 0.16 Observed 1 2 5 8 Expected 3.27 2.55 2.18 DChi² 1.58 0.12 3.64 Totals: 27 21 18 66 TOTAL number of cells = 12 WARNING: 9 out of 12 cells have 1 <= EXPECTATION < 5 Overall chi-square = 9.9588 P = 0.1264 Chi-square for equality of mean scores = 5.784033 P = 0.0555 Chi-square for trend in mean scores = 5.746874 P = 0.0165 * Chi-square for departures from trend = 0.037159 P = 0.8471 Coefficients of contingency: Pearson's = 0.362088 Cramer's = 0.274673 Here we see that although the overall test was not significant we did show a statistically significant trend in mean scores. This suggests that supporting these mothers did help lessen their burden of grief. ¬╪29175 ¬ ¬<reference list>╪310584 ¬ |2 by k| contingency table chi-square test Several proportions can be compared using a two by k chi-square test. For example, a village can be subdivided into k age groups and counts made of those individuals with and those without a particular disease marker. From the overall test you can see whether or not age has a significant effect on the disease studied. Arcus also performs a test for linear trend across the k groups. You can opt to enter your own scores for the trend test. For example, if a variable was categorised as mild, moderate or severe you would want to enter your own scores if the data were not presented in order (ref 4). You could equally use the r by c chi-square test for these functions, it just has a different style of presentation and data input. If you need coefficients of contingency then you should use the r by c chi-square function. EXAMPLE (from Armitage ref 4 p 373): The following data describe numbers of children with different sized palatine tonsils and their carrier status for Strep. pyogenes. Tonsils Present but Enlarged Greatly not enlarged enlarged Carriers 19 29 72 Non-carriers 497 269 1326 To analyse these data in Arcus you must select 2 by k from the chi-square test sub-menu of the instant functions menu in the analysis section. Then select the middle option from the 2 by k chi-square test menu. Choose a 95% confidence interval by pressing the enter key when prompted by the confidence interval menu. Then select the number of rows as 3. You then enter the above data as directed by the screen. Use carriers as successes and non-carriers as failures. For this example: Successes Failures Total Per cent Observed 19 497 516 3.682171 Expected 26.57511 489.4249 Observed 29 560 589 4.923599 Expected 30.33476 558.6652 Observed 24 269 293 8.191126 Expected 15.09013 277.9099 Total 72 1326 1398 5.150215 Total Chi² = 7.884844 P = 0.0194 * Chi² for linear trend = 7.192746 P = 0.0073 ** Remaining Chi² (non-linearity) = .6920977 P = 0.4055 Here the total chi-square test shows a statistically significant association between the classifications, i.e. between tonsil size and Strep. pyogenes carrier status. We have also shown a significant linear trend which enables us to refine our conclusions to a suggestion that the proportion of Strep. pyogenes carriers increases with tonsil size. ¬╪29175 ¬ ¬<reference list>╪310584 ¬ |Matched pairs (McNemar, Liddell)| Paired proportions have traditionally been compared using McNemar's test but an exact alternative is now available (after Liddell 1983). Arcus gives you both. You enter your data in the 2 by 2 format with discordant cells at top right and bottom left. The exact test gives you a two tailed probability and exact confidence limits for the odds ratio. You should use the exact test for your analysis, McNemar's test is included for interest only. If you need the exact confidence interval for the difference between the pair of proportions then please use the "paired proportions" function of the proportions menu from the instant functions menu of the analysis section. EXAMPLE (from Armitage ref 4 p 122): The data below represent a comarison of two media for culturing Mycobacterium tuberculosis. Fifty suspect sputum specimens were plated up on both media and the following results were obtained: Medium B Growth No Growth Medium A: Growth 20 12 No Growth 2 16 To analyse these data in Arcus you must select the matched pairs (McNemar, Liddell) option from the chi-square menu of the instant functions menu in the analysis section. Select a 95% confidence interval by pressing the enter key when prompted by the confidence interval menu. Enter the frequencies into the contingency table on screen as shown above. For this example: McNemar's test: Yates' continuity corrected Chi² = 5.785714 P = 0.0162 * After Liddell (1983): Point estimate of relative risk (R') = 6 Exact 95% confidence interval = 1.335772 to 55.07571 F = 4 P (two tailed) = 0.0129 * R' is significantly different from unity Here we can conclude that the tubercle bacilli in the experiment grew significantly better on medium A than on medium B. With 95% confidence we can state that the chances of a positive culture are between 1.34 and 55.08 times greater on medium A than on medium B. ¬╪29175 ¬ ¬<reference list>╪310584 ¬ |Mantel-Haenszel| test for a 2 by 2 series In case-control studies observed frequencies can often be represented by a series of two by two tables. Each stratum of this series represents observations taken at different times, different places or another system of sub-grouping within one large study. The estimation of relative risk can utilise the method of Mantel and Haenszel or that of Woolf. The Mantel-Haenszel method is more robust when some of the strata contain small frequencies. Data for these tests are entered as a series of two by two tables, each table corresponding to a stratum of your investigation. Each table has the standard (++), (+-), (-+), (--) format with (-+) and (--) for controls. The Mantel-Haenszel pooled estimate of the odds ratio is given with test based approximate confidence limits calculated by the method of Miettinen (ref 4). The chi-square test statistic is given with associated probability of the odds ratio being unity. EXAMPLE (from Armitage ref 4 p 463): The following data compare the smoking status of lung cancer patients with controls. Ten different studies are combined in an attempt to improve the overall estimate of relative risk. The matching of controls has been ignored because there was not enough information about matching from each study to be sure that the matching was the same in each study. Lung cancer Controls smoker non-smoker smoker non-smoker 83 3 72 14 90 3 227 43 129 7 81 19 412 32 299 131 1350 7 1296 61 60 3 106 27 459 18 534 81 499 19 462 56 451 39 1729 636 260 5 259 28 To analyse these data in Arcus you must select the Mantel-Haenszel function from the chi-square sub-menu of the instant functions menu in the analysis section. Select a 95% confidence interval by pressing the enter key when prompted by the confidence interval menu. Enter the number of tables as 10. Then enter each row of the table above as a separate 2 by 2 contingency table: i.e. The first row is entered as: Smkr Non ╔══════╤══════╗ Lung cancer ║ 83 │ 3 ║ ╟──────┼──────╢ control ║ 72 │ 14 ║ ╚══════╧══════╝ ... this is then repeated for each of the ten rows. For this example: Mantel Haenzsel Chi square = 292.3788 P < 0.0001 *** Mantel Haenzsel pooled estimate of odds ratio = 4.681639 Approximate 95% CI = 3.922422 to 5.587809 Here we can say with 95% confidence that the true population odds in favour of being a smoker were between 3.9 and 5.6 times greater in patients who had lung cancer compared with controls. This estimate of the relative risk is obviously highly significantly different from unity. ¬╪29175 ¬ ¬<confidence intervals>╪31897 ¬ ¬<reference list>╪310584 ¬ |Woolf| statistics for 2 by 2 tables & series In case-control studies observed frequencies can often be represented by a series of two by two tables. Each stratum of this series represents observations taken at different times, different places or another system of sub-grouping within one large study. The estimation of relative risk can utilise the method of Mantel and Haenszel or that of Woolf. The ¬Mantel-Haenszel╪235832 ¬ method is more robust when some of the strata contain small frequencies. Data for these tests are entered as a series of two by two tables, each table corresponding to a stratum of your investigation. Each table has the standard (++), (+-), (-+), (--) format with (-+) and (--) for controls. With the Woolf method results for an individual quad of data are displayed after you have entered that table, please remember this when entering a large series. When all tables have been entered the combined statistics (¬Haldane╪223546 ¬ corrected), including chi-square for heterogeneity, are given. EXAMPLE (from Armitage ref 4 p 463): The following data compare the smoking status of lung cancer patients with controls. Ten different studies are combined in an attempt to improve the overall estimate of relative risk. The matching of controls has been ignored because there was not enough information about matching from each study to be sure that the matching was the same in each study. Lung cancer Controls smoker non-smoker smoker non-smoker 83 3 72 14 90 3 227 43 129 7 81 19 412 32 299 131 1350 7 1296 61 60 3 106 27 459 18 534 81 499 19 462 56 451 39 1729 636 260 5 259 28 To analyse these data in Arcus you must select the Woolf function from the chi-square sub-menu of the instant functions menu in the analysis section. Select a 95% confidence interval by pressing the enter key when prompted by the confidence interval menu. Enter the number of tables as 10. Then enter each row of the table above as a separate 2 by 2 contingency table: i.e. The first row is entered as: Smkr Non ╔══════╤══════╗ Lung cancer ║ 83 │ 3 ║ ╟──────┼──────╢ control ║ 72 │ 14 ║ ╚══════╧══════╝ ... this is then repeated for each of the ten rows. For this example: Statistics from combined values with Haldane correction: Odds ratio = 4.510211 Approximate 95% CI = 3.733489 to 5.448524 Chi² for E(LOR) = 0 is 254.0865 P < 0.0001 *** Chi² for Heterogeneity = 6.532662 P = 0.6856 Here we can say that there was no convincing evidence of heterogeneity between the separate estimates of relative risk from each of the different studies. The pooled estimate suggested that with 95% confidence that the true population odds for being a smoker were between 3.7 and 5.4 times greater in lung cancer patients compared with controls. The result using the Mantel-Haenszel method gave 3.9 to 5.6, the difference is partly accounted for by the Haldane correction. I would, however, advise you to keep to the Mantel-Haenszel method for general use, it is more robust. I have included Woolf's method for those who want to go further with the inter-table statistics. ¬╪29175 ¬ ¬<confidence intervals>╪31897 ¬ ¬<reference list>╪310584 ¬ |Exact Tests| ¬<Fisher's exact test>╪243294 ¬ ¬<Matched pairs (McNemar, Liddell)╪233702 ¬ ¬<Exact confidence limits for 2 by 2 odds>╪247832 ¬ ¬<Sign test>╪249896 ¬ Various exact treatments of two by two tables are given in this section. Permutational probabilities and exact confidence limits are provided. |Fisher's Exact Test| This exact treatment of the fourfold table should be used instead of the chi square test when any of the expected frequencies are less than five. In practical terms, however, there is little point in using the fourfold chi square test when Arcus provides you with a Fisher's exact test which can cope with reasonably large numbers. Arcus uses the definition of a two tailed p value described by N. T. J. Bailey (ref 27). Finney recommends doubling the one tailed value and controversy remains. Arcus calculates the conventional exact test until the numbers are so large that the intermediate steps would cause overflow error, at this point the hyper geometric distribution is utilised. The data entry is identical to the procedure for the chi-square 2 by 2 table and indeed, results for a chi-square test are given with Fisher's exact test results. The rearranged table is displayed with the expectation of the first cell. The chi-square test results are included for educational purposes only, you should make your inferences from the Fisher's p values. EXAMPLE (from Armitage ref 4 p 130): The following data compare malocclusion of teeth with method of feeding infants. Normal teeth Malocclusion Breast fed 4 16 Bottle fed 1 21 To analyse these data in Arcus you must select the Fisher's exact test function from the exact tests sub-menu of the instant functions menu in the analysis section. Enter the frequencies into the contingency table on screen as shown above. For this example: Rearranged table: ╔════════════════╤════════════════╤════════════════╗ ║ 4 │ 1 │ 5 ║ ╟────────────────┼────────────────┼────────────────╢ ║ 16 │ 21 │ 37 ║ ╠════════════════╪════════════════╪════════════════╣ ║ 20 │ 22 │ 42 ║ ╚════════════════╧════════════════╧════════════════╝ Expectation of A = 2.380952 1-tailed probability (Upper tail) = 0.143527 (Doubled = 0.287054) 2-tailed probability (by summation) = 0.174484 Here we have to accept the null hypothesis that there is no association between these two classifications, i.e. between feeding method and malocclusion. ¬╪29175 ¬ ¬<reference list>╪310584 ¬ |Expanded Fisher-Irwin test| This allows you to see a conventional Fisher's exact test in more detail. The complete conditional distribution for the observed marginal totals is displayed. Arcus utilises double precision floating point arithmetic for the exact tests (ref 27). EXAMPLE (from Armitage ref 4 p 130): The following data compare malocclusion of teeth with type of feeding received by infants. Normal teeth Malocclusion Breast fed 4 16 Bottle fed 1 21 To analyse these data in Arcus you must select the Fisher's exact test function from the exact tests sub-menu of the instant functions menu in the analysis section. Enter the frequencies into the contingency table on screen as shown above. For this example: Rearranged table: ╔════════════════╤════════════════╤════════════════╗ ║ 4 │ 1 │ 5 ║ ╟────────────────┼────────────────┼────────────────╢ ║ 16 │ 21 │ 37 ║ ╠════════════════╪════════════════╪════════════════╣ ║ 20 │ 22 │ 42 ║ ╚════════════════╧════════════════╧════════════════╝ Expectation of A = 2.380952 A Lower Tail Individual P Upper Tail 0 0.030956848030019 0.030956848030019 1.000000000000000 1 0.202939337085679 0.171982489055660 0.969043151969981 2 0.546904315196998 0.343964978111320 0.797060662914321 3 0.856472795497186 0.309568480300188 0.453095684803002 4 0.981774323237738 0.125301527740552 0.143527204502814 5 1.000000000000000 0.018225676762262 0.018225676762262 1-sided probability (Upper tail) = 0.1435272045 (Doubled = 0.2870544090) 2-sided probability (by summation)= 0.1744840525 Here we have to accept the null hypothesis that there is no association between these two classifications, i.e. between feeding mode and malocclusion. ¬╪29175 ¬ ¬<reference list>╪310584 ¬ |Exact Confidence Limits for 2 by 2 Odds| Gart's method is used here to construct exact confidence limits for the odds ratio of a fourfold table (ref A15). The default selections are 95, 99 and 90 per cent two tailed values but you may enter individual tail areas. Thus, for a one tailed 95% confidence limit you would enter a lower tail area of 0 and an upper tail area of 5. These exact confidence limits complement Fisher's exact test of independence in a fourfold table. Please note that this iterative calculation will take a long time with large numbers. EXAMPLE (from Thomas ref A15): The following data look at the criminal convictions of twins in an attempt to investigate the hereditability of criminality. Convicted Not-Convicted Dizygotic 2 15 Monozygotic 10 3 To analyse these data in Arcus you must select exact confidence limits for 2 by 2 odds from the exact tests sub-menu. To select a 95% two tailed confidence interval just press enter when you are presented with the confidence interval menu. For this example: Rearranged table: ╔════════════════╤════════════════╤════════════════╗ ║ 15 │ 2 │ 17 ║ ╟────────────────┼────────────────┼────────────────╢ ║ 3 │ 10 │ 13 ║ ╠════════════════╪════════════════╪════════════════╣ ║ 18 │ 12 │ 30 ║ ╚════════════════╧════════════════╧════════════════╝ Fisher-Irwin p (1 sided) = 0.000465 Doubled = 0.00093 Confidence limits with 2.5% lower tail area and 2.5% upper tail area {two tailed} Observed odds ratio = 25 Confidence limits = 301.4666 and 2.753266 Reciprocal = 0.04 Confidence limits = 0.003317 and 0.363205 Here we can say with 95% confidence that the odds of being a criminal convict are between 2.75 and 301.5 times greater for identical than for non-identical twins. ¬<confidence intervals>╪31897 ¬ ¬<reference list>╪310584 ¬ |Sign test| In a sample of size n, if r individuals show a change in one particular direction then the sign test can be used to assess the significance of this change. Arcus gives you one and two sided cumulative probabilities from a binomial distribution with a projected proportion of 0.5 for the null hypothesis. An appropriate normal approximation is used with large numbers. You are also given an exact confidence interval for the proportion r/n (ref 5,6). If you need a test where the projected proportion for the null hypothesis is not 0.5 then you should use the ¬single proportion╪263180 ¬ function listed in the proportions sub-menu of the Arcus instant functions menu. EXAMPLE (from Altman ref 4 p 186) Out of a group of 11 women investigated 9 were found to have a food energy intake below the daily average and 2 above. We want to quantify the impact of 9 out of 11, i.e. how much evidence have we got that these women are different from the norm? To analyse these data in Arcus you must select the sign test from the instant functions menu of the analysis section. To select a 95% two tailed confidence interval just press enter when you are presented with the confidence interval menu. For this example: For 11 pairs with 9 on one side. Cumulative probability (2-sided) = 0.06543 (1-sided) = 0.032715 * Exact 95% Confidence limits for the Proportion: Lower Limit = 0.482248 Proportion = 0.818182 Upper Limit = 0.977122 If we were confident that this group could only realistically be expected to have a lower caloric intake then we could make inference from the one tailed p value. We do not, however, have this evidence so we must accept the null hypothesis that this proportion is not significant. We can say with 95% confidence that the true population value of the proportion lies somewhere between 0.48 and 0.98. The most sensible response to these results would be to go back and collect more data. ¬╪29175 ¬ ¬<confidence intervals>╪31897 ¬ ¬<reference list>╪310584 ¬ |Randomisation| Functions This section employs a well tried and widely accepted random number generator to randomise series of numbers for given allocation designs. The results can be used in the design of randomised studies. Please note that the random number generator is reseeded each time it is used and you have virtually no chance of using the same (pseudo)random number series for different randomisations. For more information on the random number generator used here please see "¬random numbers╪254271 ¬". a) You can randomise a series of integers for which you define the beginning and end points of the series. For example, randomising numbers from 6 to 10 might give 8 6 9 10 7, this is like shuffling 5 cards labelled 6 to 10. b) Random allocation of cases and controls for paired case-control studies. For example, you might want to randomise 50 patients into treatment (case) and placebo (control) groups for a pilot study of a new drug. This would give 50 pairs of CASE - CONTROL or CONTROL - CASE. If this was a randomised crossover study then you would give drug first if the order was CASE - CONTROL and you would give placebo first if the order was CONTROL - CASE. c) Random allocation of subjects to case or control groups for unpaired case-control studies. For example, you might want to look at the effect of a new treatment. For a randomised controlled trial you might randomly allocate some patients for this new treatment and compare them with similar patients who did not receive this treatment. For 24 patients in two groups of 12 you would enter 24 into this section of Arcus randomisation. This would give you two groups of 12 e.g.: CASES CONTROLS 2 1 5 3 6 4 7 8 9 11 10 12 13 14 15 16 19 17 20 18 21 22 24 23 Here the first patient would be allocated to the control group and the second to the treatment group etc. |Random Numbers| There is much fear of computer generated random numbers because of some bad random number generators which have cropped up over the years. This is not a problem in Arcus Pro-Stat because it uses well tried and tested methods. If you want to get down to basics you might ask, what is random?. A lecture theatre filled with Mathematicians, Philosophers and Elemental Physicists would love to debate this, enough said. What we can do is look for evidence of non-randomness such as repeated patterns. Various methods have been employed to look for non-randomness from "random" number generators since they began to emerge around 35 years ago. Several "quick and dirty" random number generators have become widely used because they are supplied with computer language compilers. These generators often use over simple methods which produce sequences of numbers with repeating patterns. This is unacceptable for statistical use. Arcus Pro-Stat uses the widely accepted Park & Miller "minimal" method extended with a Bays-Durham shuffle. This is well described by Press et al. (ref 33). Most random number generators require a seed. If the generator is given the same seed each time it is called then it will produce the same series of numbers. This is not acceptable for many purposes therefore Arcus seeds the random number generator with a number which is the taken from your computer's clock. This number is the number of hundredth's of a second which have elapsed since midnight. You will therefore understand why it is very difficult to recall the same "random" sequence from Arcus when you ask Arcus to seed the generator for you. You can also choose to enter your own seed. |Sample Size| Estimations ¬<for paired t test>╪257921 ¬ ¬<for unpaired t test>╪258472 ¬ ¬<for independent case-control>╪259089 ¬ ¬<for matched case-control>╪260701 ¬ ¬<for independent prospective>╪259879 ¬ ¬<for paired prospective>╪261568 ¬ ¬<for population surveys>╪262306 ¬ At the design stage of an investigation one must try to minimise the probability of failing to detect a real effect, i.e. type II error (false negative). Minimum sample sizes necessary to avoid given levels of type II error are calculated by Arcus for population surveys, for the comparison of proportions and for the comparison of means. Type II error is indicated in reverse by the power of a study, thus power is the probability of detecting a true effect. You are asked to select a power level for your study along with the two tailed significance level which you intend to use in subsequent analysis. The latter considers type I error, the probability of incorrectly rejecting the null hypothesis (false positive). Minimum sample sizes are estimated for the comparison of means using Student t tests, the comparison of proportions and for population surveys. Provision is made for paired and unpaired designs in case-control studies or independent group studies. All of these calculations require you to enter a value for power (the probability of detecting a true effect) and alpha (the probability of detecting a false effect); all calculations consider two tailed investigation (ref 4, 8, 11, 30, 31). Other information required depends upon the type of study being planned; each required parameter is described in the help screen of the relevant menu selection. I must emphasise the point that good design lies at the heart of good research and for important studies statistical advice should be sought at the planning stage!. ¬<reference list>╪310584 ¬ Sample Size |for Paired t Test| This function gives you the minimum number of pairs of subjects needed to detect a true difference DELTA in population means with power POWER and two sided type I error probability ALPHA (ref 30, 31). INFORMATION REQUIRED: POWER - Probability of detecting a true effect. ALPHA - Probability of detecting a false effect (two sided). DELTA - Difference in population means. SD - Estimated standard deviation of paired response differences. ¬<reference list>╪310584 ¬ ¬<sample size>╪256010 ¬ Sample Size |for Unpaired t Test| This function gives you the minimum number of experimental subjects needed to detect a true difference DELTA in population means with power POWER and two sided type I error probability ALPHA (ref 30, 31). INFORMATION REQUIRED: POWER - Probability of detecting a true effect. ALPHA - Probability of detecting a false effect (two sided). DELTA - Difference in population means. SD - Estimated standard deviation for within group differences. M - Number of control subjects per experimental subject. ¬<reference list>╪310584 ¬ ¬<sample size>╪256010 ¬ Sample Size |for Independent Case-Control| studies This function gives the minimum number of case subjects required to detect a real odds ratio or case exposure rate with power POWER and two sided type I error probability ALPHA. This sample size is also given as a continuity corrected value intended for use with corrected chi-square and Fisher's exact tests (ref 10, 30). POWER - Probability of detecting a real effect. ALPHA - Probability of detecting a false effect (two sided). P0 - Probability of exposure in controls. (P1 - Probability of exposure in case subjects.) *Input P1 or OR. (OR - Odds ratio of exposures between cases and controls.) M - Number of control subjects per case subject. ¬<reference list>╪310584 ¬ ¬<sample size>╪256010 ¬ Sample Size |for Independent Prospective| studies This function gives the minimum number of case subjects required to detect a true relative risk or experimental event rate with power POWER and two sided type I error probability ALPHA. This sample size also given as a continuity corrected value intended for use with corrected chi-square and Fisher's exact tests (ref 8, 10, 30). POWER - Probability of detecting a real effect. ALPHA - Probability of detecting a false effect (two sided). P0 - Probability of event in controls. (P1 - Probability of event in experimental subjects) *Input P1 or RR. (RR - Relative risk of events between experimental subjects and controls.) M - Number of control subjects per experimental subject. ¬<reference list>╪310584 ¬ ¬<sample size>╪256010 ¬ Sample Size |for Matched Case-Control| studies This function gives you the minimum sample size necessary to detect a true odds ratio OR with power POWER and a two sided type I error probability ALPHA. If you are using more than one control per case then this function also provides the reduction in sample size relative to a paired study that you can obtain using your number of controls per case (ref 10, 30). INFORMATION REQUIRED: POWER - Probability of detecting a real effect. ALPHA - Probability of detecting a false effect (two sided). R - Correlation coefficient for exposure between matched cases and controls. P0 - Probability of exposure in the control group. P1 - Number of control subjects matched to each case subject. OR - Odds ratio. ¬<reference list>╪310584 ¬ ¬<sample size>╪256010 ¬ Sample Size |for Paired Prospective| studies This function gives you the minimum number of subject pairs that you require to detect a true relative risk RR with power POWER and two sided type I error probability ALPHA (ref 10, 30). INFORMATION REQUIRED: POWER - Probability of detecting a real effect. ALPHA - Probability of detecting a false effect (two sided). R - Correlation coefficient for failure between paired subjects. ***Next input is either P0 and RR or P0 and P1 (when RR=P0/P1).*** P0 - Event rate in the control group. *(P1 - Event rate in experimental group.) *(RR - Risk of failure of experimental subjects relative to controls. ¬<reference list>╪310584 ¬ ¬<sample size>╪256010 ¬ Sample Size |for Population Surveys| This function gives you the minimum number of subjects that you require for a survey of a population for a difference in the proportion of individuals in that population displaying a particular factor (ref 10). INFORMATION REQUIRED: Confidence level (i.e. 1-ALPHA) (ALPHA - Probability of detecting a false effect (two sided).) Population size Proportion (as %) of the population displaying a particular factor. A difference (as %) in that proportion you want to be able to detect. ¬<reference list>╪310584 ¬ ¬<sample size>╪256010 ¬ |Proportions| ¬<Single proportion>╪263180 ¬ ¬<Paired proportions>╪266823 ¬ ¬<Unpaired proportions>╪265024 ¬ This section constructs confidence limits and probabilities for various presentations of proportions. Exact tests are employed wherever possible. |Single Proportion| This function gives you the exact and approximate confidence interval for a single proportion. There is also an hypothesis test for the proportion in comparison with the expected proportion under the null hypothesis. You enter this expected proportion when prompted for the probability of success on each trial. This test uses the relevant binomial distribution. For example, when comparing two preparations of a drug, if 65 out of 100 patients preferred preparation A then the significance of this majority could be expressed by the hypothesis test and described by the confidence interval (ref 4, 11). EXAMPLE (from Armitage ref 4 p 116): In a trial of two analgesics, X and Y, 100 patients tried each drug for a week. The trial order was randomised. 65 out of 100 preferred drug Y. To analyse these data in Arcus you must select single proportion from the proportions sub-menu of the instant functions menu in the analysis section. To select a 95% confidence interval just press enter when you are presented with the confidence interval menu. Enter n as 100 and r as 65. Enter the binomial test proportion as 0.5, this is because you would expect 50% of an infinite number of patients to prefer drug Y if there was no difference between X and Y. For this example: Proportion = 0.65 Exact 95% Confidence Limits: Lower Limit = 0.548151 Upper Limit = 0.742706 Using null hypothesis that the population proportion equals 0.5: Binomial two tailed P = 0.0035 ** Here we can conclude that the proportion was statistically significantly different from 0.5. With 95% confidence we can state that the true population value for the proportion lies somewhere between 0.55 and 0.74. ¬╪29175 ¬ ¬<confidence intervals>╪31897 ¬ ¬<reference list>╪310584 ¬ |Unpaired Proportions| Two independent proportions may be compared using this function. It is assumed that your data have been observed from random samples of the two independent populations. For example, the proportion of patients surviving a particular surgical emergency could be compared for surgical and non-surgical management protocols. An hypothesis test for the equality of these proportions is given along with a confidence interval for the difference between the proportions. A normal approximation is used for both of these methods thus you should avoid small numbers (ref 4). EXAMPLE (from Armitage ref 4 p 124): Two methods of treatment, A and B, for a particular disease were investigated. Out of 257 patients treated with method A 41 died and out of 244 patients treated with method B 64 died. We want to compare these fatality rates. To analyse these data in Arcus you must select unpaired proportions from the proportions sub-menu of the instant functions menu in the analysis section. To select a 95% confidence interval just press enter when you are presented with the confidence interval menu. Enter n1 as 257, r1 as 41, n2 as 244 and r2 as 64. For this example: Proportion 1 = 0.159533 Proportion 2 = 0.262295 95% confidence interval for the difference = -0.173829 to -0.031695 Normal deviate (Z) = -2.824689 Two tailed P = 0.0047 ** One tailed P = 0.0024 ** Here we can conclude that the difference between these two proportions is statistically significantly different from zero. With 95% confidence we can state that the true population fatality rate with treatment B is between 0.03 and 0.17 greater than with treatment A. ¬╪29175 ¬ ¬<confidence intervals>╪31897 ¬ ¬<reference list>╪310584 ¬ |Paired Proportions| Two proportions may be paired by sharing a common feature. For example, when comparing two culture media a sputum sample from one patient is plated onto both culture media, this is the "pairing". The procedure is then repeated for a number of patients to allow proportions to be compared. Arcus gives you an hypothesis test for the equality of these proportions and a confidence interval for the difference between them. Exact methods are used throughout (ref 4, 20). The two tailed p value from the hypothesis test equates with the exact test for a paired fourfold table (Liddell) which has been presented above. With large numbers an appropriate normal approximation is used in the hypothesis test. EXAMPLE (from Armitage ref 4 p 122): The data below represent a comarison of two media for culturing Mycobacterium tuberculosis. Fifty suspect sputum specimens were plated up on both media and the following results were obtained: Medium B Growth No Growth Medium A: Growth 20 12 No Growth 2 16 N = 50 To analyse these data in Arcus you must select paired proportions from the proportions sub-menu of the instant functions menu in the analysis section. Select a 95% confidence interval by pressing enter when you are presented with the confidence interval menu. Enter n as 50, ++(k) as 20, +-(r) as 12 and -+(s) as 2. For this example: Proportion 1 = 0.64 (k+r)/n Proportion 2 = 0.44 (k+s)/n Proportion difference = 0.2 (r-s)/n Cumulative probability (2-sided) = 0.012939 * (1-sided) = 0.00647 ** Exact 95% Confidence Limits for the proportion difference: Lower Limit = 0.040251 Upper Limit = 0.270014 Here we can conclude that the proportion difference is statistically significantly different from zero. With 95% confidence we can say that the true population value for the proportion difference lies somewhere between 0.04 and 0.27. This leaves us with little doubt that medium A is more effective than medium B for the culture of tubercle bacilli. Compare these results with the exact test for ¬matched pairs╪233702 ¬. Some find it easier to discuss this type of result in terms of estimated relative risk. ¬╪29175 ¬ ¬<confidence intervals>╪31897 ¬ ¬<reference list>╪310584 ¬ |Miscellaneous| Functions ¬<Relative risk>╪269584 ¬ ¬<Diagnostic test 2 by 2 table>╪272252 ¬ ¬<Likelihood ratios for 2 by k tables>╪276470 ¬ ¬<Number needed to treat>╪279627 ¬ ¬<False result probabilities>╪282414 ¬ ¬<Standardized mortality ratios>╪285320 ¬ |Relative Risk| in Incidence Studies In studies of the incidence of a particular outcome in two groups of individuals, defined by the presence or absence of a particular characteristic, the odds ratio for the resultant fourfold table becomes the relative risk. Relative risk is used for prospective studies where you follow groups with different characteristics to observe whether or not a particular outcome occurs: Group 1 Group 2 OUTCOME YES A B NO C D Relative Risk = [A/(A+C)]/[B/(B+D)] In retrospective studies where you select subjects by outcome not by group characteristic then you would use the odds ratio ((a/c)/(b/d)) and not the relative risk. The odds ratio is often appropriate to case-control studies. Arcus gives confidence intervals for the odds ratio in the 2 by 2 chi-square test and in the exact confidenece interval for 2 by 2 odds which is listed in the exact tests menu. This function gives you the relative risk with a confidence interval. The iterative methods of approximation recommended by Gart and Nam are used in this function (ref 35). Please note that relative risk, risk ratio and likelihood ratio are the same calculation. EXAMPLE (from Altman ref 5 p 267) The following data represent a prospective investigation of Apgar score in babes who had been classified either as symmetric or asymmetric growth retardation on the basis of ultrasound investigation. Symmetric IUGR Asymmetric IUGR Apgar < 7 2 33 Apgar >=7 14 58 To analyse these data in Arcus you must select relative risk from the miscellaneous sub-menu of the instant functions menu in the analysis section. Select a 95% confidence interval by pressing enter when you are presented with the confidence interval menu. Then enter the above frequencies into the 2 by 2 table on the screen. For this example: Risk ratio (relative risk in incidence study) = 0.344697 The 95% CI = 0.094377 to 1.040814 The 90% CI = 0.114327 to 0.902673 N.B. This is more accurate than the logit confidence interval quoted in ref 5. Here we can say that the risk of a low Apgar score for symmetrically growth retarded babes is about 35% of that risk for their asymmetrically growth retarded counterparts. There are, however, rather few observations in the symmetrical group which is reflected by the broad 95% confidence interval. An appropriate response to these "suggestive" results would be to go back and collect more data. ¬<confidence intervals>╪31897 ¬ ¬<reference list>╪310584 ¬ |Diagnostic Test 2 by 2 table| The quality of a diagnostic test is often expressed in terms of sensitivity and specificity. Sensitivity is the ability of that test to pick up what you are looking for and specificity is the ability of the test to reject what you are not looking for. DISEASE Present Absent TEST + a (true +ve) b (false +ve) - c (false -ve) d (true -ve) Sensitivity = a/(a+c) Specificity = d/(b+d) Likelihood ratio of a positive test = [a/(a+c)]/[b/(b+d)] Likelihood ratio of a negative test = [c/(a+c)]/[d/(b+d)] Likelihood ratios have become useful because they enable one to quantify the effect a particular test result has on the probability of a certain diagnosis or outcome. Using a simplified form of Bayes' theorem: posterior odds = prior odds * likelihood ratio where odds = probability/(1-probability) probability = odds/(odds+1) This Arcus function gives you the predictive values (post-test likelihood) with change, prevalence (pre-test likelihood), sensitivity, specificity and likelihood ratios (ref 12, 36). The confidence intervals for the likelihood ratios are constructed using the iterative method suggested by Gart and Nam (ref 35). This function is not truly Bayesian because it does not use any starting probability. It does, however, provide a generator for likelihood ratios which can then be used to direct the flow of probability in Bayesian analysis. For an excellent account of this approach in medical diagnosis I advise you to read David Sackett's book (ref 12). EXAMPLE (from Sackett ref 12 p 109): Initial creatine phosphokinase (CK) levels were related to the subsequent diagnosis of acute myocardial infarction (MI) in a group of patients with suspected MI. 80 international units of CK or greater was taken as an arbitrary positive test result: MI No MI CK >= 80 215 16 CK < 80 15 114 To analyse these data in Arcus you must select diagnostic test 2 by 2 table from the miscellaneous sub-menu of the instant functions menu in the analysis section. Select a 95% confidence interval by pressing enter when you are presented with the confidence interval menu. Then enter the above frequencies into the 2 by 2 table on the screen. For this example: Disease / Feature: present absent totals Test: ╔══════════════════╤══════════════════╤══════════════════╗ Positive║ 215 │ 16 │ 231 ║ ║ A│B │ ║ ╟──────────────────┼──────────────────┼──────────────────╢ Negative║ 15 C│D 114 │ 129 ║ ║ │ │ ║ ╟──────────────────┼──────────────────┼──────────────────╢ Totals║ 230 │ 130 │ 360 ║ ╚══════════════════╧══════════════════╧══════════════════╝ Prevalence (pre-test likelihood of disease) = 0.638889 = 64% Predictive value of +ve test (post-test likelihood of disease) = 0.930736 = 93% {change = 29%} Predictive value of -ve test (post-test likelihood of no disease) = 0.116279 = 12% {change = -52%} Sensitivity (true positive rate) = 0.934783 = 93% Specificity (true negative rate) = 0.876923 = 88% Likelihood ratios with 95% confidence intervals: LR (positive test) = 7.595109 (4.897431 to 12.12324) LR (negative test) = 0.074371 (0.045345 to 0.120077) Here we can say with 95% confidence that CK results of >=80 are at least 4.9 times more likely to come from patients who have had an MI than they are to come from those who have not had an MI. Also with 95% confidence we can say that CK results of <80 are at most only one tenth (0.12) as likely to come from patients who have had an MI than they are to come from those who have not had an MI. ¬<confidence intervals>╪31897 ¬ ¬<reference list>╪310584 ¬ |Likelihood ratios for 2 by k tables| The quality of a diagnostic test is often expressed in terms of sensitivity and specificity. Sensitivity is the ability of that test to pick up what you are looking for and specificity is the ability of the test to reject what you are not looking for. DISEASE Present Absent TEST + a (true +ve) b (false +ve) - c (false -ve) d (true -ve) Sensitivity = a/(a+c) Specificity = d/(b+d) Likelihood ratio of a positive test = [a/(a+c)]/[b/(b+d)] Likelihood ratio of a negative test = [c/(a+c)]/[d/(b+d)] Likelihood ratios have become useful because they enable one to quantify the effect a particular test result has on the probability of a certain diagnosis or outcome. Using a simplified form of Bayes' theorem: posterior odds = prior odds * likelihood ratio where odds = probability/(1-probability) probability = odds/(odds+1) We can generalise these methods to situations of more than two test outcomes. In this situation we have a two by k design where k is the number of test outcomes studied. If one test outcome is called test level j then the likelihood ratio at level j is given by: likelihood ratio j = p(tj_disease)/p(tj_no disease) where p(tj_ is the proportion displaying the relevant test result at level j This Arcus function gives you likelihood ratios and their confidence intervals for each level of test result (ref 12, 36). The confidence intervals for the likelihood ratios are constructed using the iterative method suggested by Gart and Nam (ref 35). EXAMPLE (from Sackett ref 12 p 111): Initial creatine phosphokinase (CK) levels were related to the subsequent diagnosis of acute myocardial infarction (MI) in a group of patients with suspected MI. Four ranges of CK result were chosen for the study: MI No MI CK >= 280 97 1 CK = 80-279 118 15 CK = 40-79 13 26 CK = 1-39 2 88 To analyse these data in Arcus you must select likelihood ratios for 2 by k tables from the miscellaneous sub-menu of the instant functions menu in the analysis section. Select a 95% confidence interval by pressing enter when you are presented with the confidence interval menu. Enter the number of test levels as 4 then enter the above frequencies as prompted on the screen. For this example: RESULT + FEATURE - FEATURE Likelihood ratio with 95% CI 1 97 1 54.82609 (9.923024 to 311.5679) 2 118 15 4.446377 (2.772549 to 7.315978) 3 13 26 0.282609 (0.151798 to 0.524821) 4 2 88 0.012846 (0.003513 to 0.046229) Here we can say with 95% confidence that CK results of >=280 are at least ten (9.9) times more likely to come from patients who have had an MI than they are to come from those who have not had an MI. ¬<confidence intervals>╪31897 ¬ ¬<reference list>╪310584 ¬ |Number needed to treat| The object of treating patients is to prevent adverse outcomes. If we look at one treatment or intervention in isolation then we can study its effect on the outcome or the adverse effect in question. Laupacis et al. quote the large Veterans Administration Trial where anti-hypertensives were investigated over three years for their effect on target organ damage rates (ref 37). Let us look at the definitions of some outcome statistics: Treated Placebo ADVERSE EVENT YES A B NO C D LET: Pc = proportion of subjects in control group who suffer an event Pt = proportion of subjects in treated group who suffer an event Pc = B / (B + D) Pt = A / (A + C) THEN: Relative risk reduction = (Pc - Pt) / Pc = RR Absolute risk reduction = Pc - Pt = ARR = RR * Pc Number needed to treat = 1 / (Pc - Pt) = 1 / ARR Arcus gives you relative risk, relative risk reduction, absolute risk reduction and the number needed to treat. Confidence intervals for each of these statistics are calculated using the iterative approaches advocated by Gart and Nam (ref 35, 38). EXAMPLE (from Haynes & Sackett ref 38): In a trial of a drug for the treatment of severe congestive heart failure 607 patients were treated with a new angiotensin converting enzyme inhibitor (ACEi) and 607 other patients were treated with a standard non-ACEi régime. 123 out of 607 patients on the non-ACEi régime died within six months and 94 out of the 607 ACEi treated patients died within six months. To analyse these data in Arcus you must select number needed to treat from the miscellaneous sub-menu of the instant functions menu in the analysis section. Select a 95% confidence interval by pressing enter when you are presented with the confidence interval menu. Enter the number of controls as 607 with 123 suffering an event and enter the number treated as 607 with 94 suffering an event. For this example: Proportion of controls suffering an event = 0.202636 Proportion of treated suffering an event = 0.15486 With 95% CI's: Relative risk = 0.764228 (0.598901 to 0.974216) Relative risk reduction = 0.235772 (0.025784 to 0.401099) Absolute risk reduction = 0.047776 (0.005225 to 0.081277) Number needed to treat = 21 (12 to 191) Here we can say, with 95% confidence, that you need to treat as many as 191 or as few as 12 patients in severe congestive heart failure with this ACEi in order to prevent one death that would not have been prevented with the standard non-ACEi therapy in six months of treatment. ¬<confidence intervals>╪31897 ¬ ¬<reference list>╪310584 ¬ |False result probabilities| When considering a diagnostic test for screening populations it is important to consider the number of false negative and false positive results you will have to deal with. The quality of a diagnostic test is often expressed in terms of sensitivity and specificity. Sensitivity is the ability of that test to pick up what you are looking for and specificity is the ability of the test to reject what you are not looking for. DISEASE Present Absent TEST + a (true +ve) b (false +ve) - c (false -ve) d (true -ve) Sensitivity = a/(a+c) Specificity = d/(b+d) We can apply Bayes' theorem if we know the approximate likelihood that a subject has the disease before they come for screening, this is given by the prevalence of the disease. For low prevalence diseases the false negative rate will be low and the false positive rate will be high. For high prevalence diseases the the false negative rate will be high and the false positive rate will be lower. People are often surprised by the high numbers of projected false positives, you need a highly specific test to keep this number low. The false positive rate of a screening test can be reduced by repeating the test. In some cases a test is performed three times and the patient is declared positive if at least two out of the three component tests were positive. This Arcus function simply gives you the probability of false positive and false negative results for a given prevalence of the disease being tested for (ref 8). EXAMPLE (from Fleiss ref 8 p 9): In a hypothetical example 2000 patients were tested with a screening test for a disease. Of these 2000 patients 1000 were known to have the disease and 1000 were known to be free of the disease: DISEASE Present Absent TEST + 950 (true +ve) 10 (false +ve) - 50 (false -ve) 990 (true -ve) To analyse these data in Arcus you must select false result probabilities from the miscellaneous sub-menu of the instant functions menu in the analysis section. Enter the true +ve rate as 0.95 (950/(950+50)) and the false +ve rate as 0.01 (10/(990+10)). Enter the prevalence as 1 in 100 by entering n as 100. For this example: For prevalence of 100 per ten thousand of population tested: Test SENSITIVITY = 95% Probability of a FALSE POSITIVE result = 0.510309 Test SPECIFICITY = 99% Probability of a FALSE NEGATIVE result = 0.00051 Here we see that more than half of the patients tested will give a positive test when they do not have the disease. This is clearly not acceptable for a full screening method but could be used as pre-screening before further tests if there was no better initial test available. ¬<reference list>╪310584 ¬ |Standardized Mortality Ratios| This selection uses the indirect method to calculate standardized mortality ratios. You must supply the mortality rates from a reference population, often census data, and the size of each group of your study population. For each (age) group you enter the size of that group in your study population and the age/group specific mortality from the general population. You are then asked about the units in which your mortality data were entered, for example if you entered deaths per 10,000 you should enter 10,000 and if you entered decimal fractions you should enter 1. The SMR is expressed in ratio and decal-integer formats along with its approximate confidence limits. A test based on the null hypothesis that the number of observed and expected deaths are equal is also given. This test uses a Poisson distribution (ref 4, 2, 11). EXAMPLE (from Bland ref 2 p 301): The following data represent the age-specific mortality rates for liver cirrhosis in men and the number of male doctors in each age stratum: Age group Mortality per million men per year Number of male doctors 15-24 5.895 1080 25-34 13.050 12860 35-44 46.937 11510 45-54 161.503 10330 55-64 271.358 7790 To analyse these data in Arcus you must select standardized mortality ratios from the miscellaneous sub-menu of the instant functions menu in the analysis section. Enter the number of groups as 5 then enter mortality and group size for each age group. Note that group size refers to the study group of doctors and not the male population as a whole who were used to derive the mortality data. Enter the mortality denominator as 1000000. Then after the expectation table enter the observed deaths as 14. Select a 95% confidence interval by pressing enter when you are presented with the confidence interval menu. For this example: Group(age)-specific Observed Population Expected Deaths mortality 0.000005859 1080 0.006328 0.00001305 12860 0.167823 0.000046937 11510 0.540245 0.000161503 10330 1.668326 0.000271358 7790 2.113879 Total = 4.496601 Standardized Mortality Ratio = 3.113463 (sometimes quoted as 100 x integer = 311) 95% confidence interval = 1.482561 to 4.744365 (148 to 474) Probability of observing 14 or more deaths by chance P = 0.0002 *** Probability of observing 14 or fewer deaths by chance P = 0.9999 Here we can see that the total expected deaths from liver cirrhosis in male doctors is 4.5 per year. The observed number, 14, was statistically highly significantly greater than expected. With 95% confidence we can state that male doctors in this country exhibit between 1.5 and 4.7 times the number of deaths from liver cirrhosis than expected from the general male population of a similar age distribution. If the reason for this SMR is not obvious to you then please attend a "ward night out" - hic! ¬╪29175 ¬ ¬<confidence intervals>╪31897 ¬ ¬<reference list>╪310584 ¬ The Arcus |Algebraic Calculator| This function is available throughout Arcus. It is called up by pressing the key combination [Alt]+[C]. You can use it to evaluate complex expressions or to perform simple arithmetic. A seventy character algebraic expression evaluator is provided. All calculations are done in double precision. If you wish to evaluate an expression which consists of more than seventy characters then you can use the Arcus worksheet; the result, however, will be in single precision only. The functions available are listed in the help screens which are invoked by the usual F1 key press. These are the functions which are available in the Arcus Worksheet plus LR which represents the last result provided by this calculator. You can use LR in an expression even when the last result was not calculated in the present calculator session. Supported functions are: Constants: PI EE as e ABS absolute value CLOG common (base 10) logarithm CEXP anti log (base 10) EXP anti log (base e) LOG natural (base e, Naperian) logarithm SQR square root ! factorial (max 34) LN! log factorial IZ normal deviate for a p value UZ upper tail p for a normal deviate LZ lower tail p for a normal deviate ^ exponentiation (to the power of) + addition - subtraction * multiplication / division \ integer division ARCCOS arc cosine ARCCOSH arc hyperbolic cosine ARCCOT arc cotangent ARCCOTH arc hyperbolic cotangent ARCCSC arc cosecant ARCCSCH arc hyperbolic cosecant ARCTANH arc hyperbolic tangent ARCSEC arc secant ARCSECH arc hyperbolic secant ARCSIN arc sine ARCSINH arc hyperbolic sine ATN arc tangent COS cosine COT cotangent COTH hyperbolic cotangent CSC cosecant CSCH hyperbolic cosecant SINH hyperbolic sine SECH hyperbolic secant SEC secant TAN tangent TANH hyperbolic tangent AND logical AND NOT logical NOT OR logical OR < less than = equal to > greater than Please note that the largest factorial allowed is 170! but you can work with Log factorials via the LOG! function, e.g. LOG!(171). Calculations give an order of priority to arithmetic operators, this must be considered when entering expressions. For example, the result of the expression "6 - 3/2" is 4.5 and not 1.5 because division takes priority over subtraction. The following list gives the priority of arithmetic operators in descending order: 1. Exponentiation (^) 2. Negation (-X) (Exception = x^-y; i.e. 4^-2 is 0.0625 and not -16) 3. Multiplication and Division (*, /) 4. Integer Division (\) 5. Addition and Subtraction (+, -) As you work through a session with the Arcus calculator you can save individual expressions and their results to a notepad by pressing S or F2. The notepad is activated when you finish the present calculator session, at this point it will present you with a list of all the results and expressions which you have saved using the S or F2 key during the preceding session. The notepad can be edited and the results sent to a printer or to the current log file. An expression and result stack is available in this calculator. You save results and their expressions to the stack when you press S or F2, i.e. the same process as saving results to the notepad. You can access information from the stack for subsequent calculations using the up and down cursor keys. These cursor keys enable you to search up and down the stack for old results or expressions to edit. |APPENDICES| ¬<Glossary>╪292963 ¬ ¬<Error Codes>╪293909 ¬ ¬<ASCII codes>╪294887 ¬ Appendix One (|Glossary|) df = degrees of freedom ^ = to the power of ^Key = Ctrl + another Key / = divided by * = multiplied by Z = standardized normal deviate r = Pearson's product moment correlation coefficient p = probability, see ¬╪29175 ¬ α = significance level x = individual value of a vector/group/sample n = vector/group/sample size µ = mean (e.g. arithmetic mean, µ = x/n) VAR = variance (e.g. of mean, s² = Σx²-(Σx)²/n) SD = standard deviation (e.g. of mean, s = SQR(VAR)) SE = standard error (e.g. of mean, SEM = SD/SQR(n)) MS = mean square CI = confidence interval, see ¬<confidence intervals>╪31897 ¬ ln(x) = natural (Naperian, base e) logarithm of x vs = versus DOS = disk operating system ROM = read only memory PC = personal computer Program = programme Disk = disc Appendix Two (|Error Codes|) The error trap within Arcus Pro-Stat provides messages which explain most of the common error states but error numbers alone are sometimes given: 5 Illegal function requested 6 Overflow/Under flow (Numbers >3.4E+38 or <1.7E-38 or vice versa for negatives) 7 Out of memory 9 Array or memory error 11 Division by zero 14 Out of memory for some text and internal program data 16 Formula too complex 24 Waited too long for printer (beep) 25 Printer fault 27 Out of paper 51 Internal computer error 53 Requested disk file not found 54 Bad file mode 55 Attempt to open an already open file (Internal) 57 Disk drive fault 61 Disk full 64 Bad file name 67 Too many files on disk/directory 68 Requested disk does not exist 70 Disk/File access denied 71 Disk drive not closed 72 Disk fault 76 Path not found Appendix Three (|ASCII codes|) These are the decimal codes which can be used in the Arcus database CHR function and which are returned by the Arcus database ASC function. Please remember that all of these characters are accessible through an extended keyboard by holding down the Alt key and tapping out the relevant code on the right hand numeric key pad. The table below lists the characters for codes 33 to 254. Values below this do have character representations but they double as control characters, e.g. 9 is a tab. It is best to avoid these control characters if you can. The extended character set is represented by values above 126. Please note that extended characters may appear different on different computers, most notably those running foreign language settings of DOS. 30 40 50 60 70 80 90 100 110 120 130 140 150 160 0 ( 2 < F P Z d n x é î û á 1 ) 3 = G Q [ e o y â ì ù í 2 * 4 > H R \ f p z ä Ä ÿ ó 3 ! + 5 ? I S ] g q { à Å Ö ú 4 " , 6 @ J T ^ h r å É Ü ñ 5 # - 7 A K U _ i s } ç æ ¢ Ñ 6 $ . 8 B L V ` j t ~ ê Æ £ ª 7 % / 9 C M W a k u ë ô ¥ º 8 & 0 : D N X b l v Ç è ö ₧ ¿ 9 ' 1 ; E O Y c m w ü ï ò ƒ ⌐ 170 180 190 200 210 220 230 240 250 0 ┤ ╛ ╚ ╥ ▄ µ ≡ · 1 ½ ╡ ┐ ╔ ╙ ▌ τ ± √ 2 ¼ ╢ └ ╩ ╘ ▐ Φ ≥ ⁿ 3 ¡ ╖ ┴ ╦ ╒ ▀ Θ ≤ ² 4 « ╕ ┬ ╠ ╓ α Ω ⌠ ■ 5 » ╣ ├ ═ ╫ ß δ ⌡ 6 ░ ║ ─ ╬ ╪ Γ ∞ ÷ 7 ▒ ╗ ┼ ╧ ┘ π φ ≈ 8 ▓ ╝ ╞ ╨ ┌ Σ ε ° 9 │ ╜ ╟ ╤ █ σ ∩ ∙ Code 170 and 124 characters are not shown above because they are special characters used by this hypertext system. 170 is the angle bar on most keyboards and 124 is the vertical dashed line on most keyboards. In this hypertext system 124 is used either side of a section title and 170 is used either side of a link item. These characters can not be used in the body of the hypertext. |HELP| This hypertext system provides an electronic user guide for Arcus Pro-Stat. You navigate its pages using the following key strokes: [] Move up one line [] Move down one line [Page Up] Move up one page [Page Dn] Move down one page [Tab] Move to the next link item [Shift]+[Tab] Move to the previous link item [Enter] Select the highlighted link item [Home] Move to top of current section [End] Move to bottom of current section [I] Search the title index [S] Search the entire help text for a word or phrase [B] Move back a page [P], [E] Edit ± send current section to log file or printer [Q], [Esc] Quit Arcus Hypertext The left mouse button selects the link item or the bottom menu bar item which is at the mouse cursor location when you press it. The right button quits this hypertext help system. Please note that all of the information in Arcus hypertext help is contained in printed form in the Arcus reference manual. For more information please see ¬<Hypertext>╪298521 ¬. |Hypertext| Arcus Pro-Stat has it's own hypertext engine. This provides on-line help within all Arcus software and gives you the opportunity to customise Arcus to your own needs. All of the help text is contained in a file called HELP.HTT. This is arranged into chapters which are referred to as sections. Each section has a title and all of the section titles are listed in the index. A section may contain links to other related sections. Each link is called a link item. Link items are shown as highlighted text and are often contained in angles e.g. <Link Item>. In order to move to the section denoted by a link item you must first make sure that the link item is active. On color monitors, active link items are displayed in bright green and inactive link items are dull cyan. To make a link item active just move through the different link items by pressing the tab key. When you have made your chosen link item active you can select it by pressing the enter key. Alternatively, click on any link item with the mouse and the left hand mouse button. If you want to move back to the page you were reading before you selected the link item then press [B]. The number of back pages available is displayed by the [B] button at the bottom left of the screen. If you can not find what you are looking for in the index then you can search the entire help text by pressing [S]. This searches for any word or phrase that you specify. The following keys are active in Arcus Hypertext: [] Move up one line [] Move down one line [Page Up] Move up one page [Page Dn] Move down one page [Tab] Move to the next link item [Shift]+[Tab] Move to the previous link item [Enter] Select the highlighted link item [Home] Move to top of current section [End] Move to bottom of current section [I] Search the title index [S] Search the entire help text for a word or phrase [B] Move back a page [P], [E] Edit ± send current section to log file or printer [Q], [Esc] Quit Arcus Hypertext The left mouse button selects the link item or the bottom menu bar item which is at the mouse cursor location when you press it. The right button quits this hypertext help system. ¬<Hypertext Help System Maintenance>╪300898 ¬ |Hypertext Help System Maintenance| You can modify and/or expand Arcus Hypertext. The HELP.HTT file, which contains all of the hypertext, is a plain ASCII text file. It can be changed using any text processor. This is, however, a very large file which demands a good text processor, EDIT in DOS often can not cope with this. The easiest way to maintain HELP.HTT is to select "hypertext help system maintenance" from the information menu. This is enables you to work through Arcus hypertext, edit specified sections and create new ones. Your old hypertext file is saved as HELP.BAK. If you are planning to do a lot of hypertext maintenance in Arcus then please aim to use a fast computer with an efficient hard disk drive. The re-indexing procedure is time consuming on a 286 with an un-cached hard disk. A well configured 486 with a reasonably efficient hard drive will rapidly re-index Arcus Hypertext. Disk cache software such as SMARTDRV in MS-DOS 6 gives a large improvement in hard disk operation. There are only two special characters which you must remember when editing Arcus hypertext, these are the vertical dashed line and the angle bar. The vertical dashed line is usually at the bottom left of your keyboard to the left of Z and is usually the shifted version of the back slash \. The vertical dashed line has the ASCII code 124. The angle bar is near the top left hand corner of most keyboards and is usually the shifted version of the single opening quote `. The angle bar has the ASCII code 170. Neither of these characters can be displayed here so let the vertical dashed line = {124} and let the angle bar = {170}. You should also avoid the use of ASCII character 216 (╪). To mark text as a title you must include two {124} on that line. There must be no other text on the title line. To mark text as a link item you must enclose it in two {170}'s. Only the first twenty characters of a title or a link item are used for indexing and linking. Try to use link items which match section titles exactly, this enables Arcus to do all indexing for you automatically. Sample of hypertext: {124}Section 1{124} This is a example of body text in Arcus Hypertext. For more information please see {170}body text{170}. {124}Body Text{124} This is the section on body text which links to the link item in section 1. Thus, the only restrictions on hypertext are the use of ASCII characters 124, 170, 216 and control characters such as tabs (ASCII 9). You can use any other ASCII characters, for example, you can compose diagrams using the line drawing charaters apart from 216 (see ¬ASCII codes╪294887 ¬). There are no practical limits on the size of the Arcus hypertext file. If you have a vast number of sections and a large worksheet open then you might run into memory problems on a computer with little free memory. Otherwise you should be able to run your own customised versions of Arcus Hypertext without any problems. If you teach statistical methods the please see ¬educational uses╪304017 ¬. |Educational Uses| of Arcus Pro-Stat Arcus Pro-Stat has been written for use by people of all levels of statistical expertise. Some Arcus users have written their own versions of the ¬hypertext╪298521 ¬ help system to give additional explanations and exercises to their students. Arcus is also used by many experienced statisticians. There is therefore the potential for someone to learn statistical methods with Arcus and then go on to practise those methods with the same package. This avoids a second learning curve. |Finish| This closes the current Arcus session. If you have forgotten to save any new or altered worksheet data then you will be prompted to do so before leaving Arcus. |Information| This section provides pages of text on using Arcus in your approach to good statistical design, analysis and presentation. There is also an interactive statistical method selection session which covers the more simple analyses. |Function Overview| Here is a brief summary of the functions within the analysis section of Arcus: ¬DESCRIPTIVE STATISTICS╪80612 ¬ ~~~~~~~~~~~~~~~~~~~~~~ Number, arithmetic mean, variance, standard deviation, standard error of the mean, user defined confidence interval for the mean, geometric mean, skewness, kurtosis, maximum, upper quartile, median, lower quartile, minimum, user defined quantile. ¬ARITHMETICAL MANIPULATION╪78201 ¬ ~~~~~~~~~~~~~~~~~~~~~~~~~ Manipulate one or several worksheet columns using your own formulae. Transformations for proportions. ¬PICTORIAL STATISTICS╪81471 ¬ ~~~~~~~~~~~~~~~~~~~~ Histogram, box and whisker, scatter, normal, survival, error bar, spread and ladder. ¬PARAMETRIC╪87475 ¬ ~~~~~~~~~~ Single sample Student t, paired Student t, unpaired Student t, F (variance ratio), Z (normal distribution) and Shapiro-Wilk W test for non-normality. ¬NONPARAMETRIC╪98877 ¬ ~~~~~~~~~~~~~ Mann-Whitney U, Wilcoxon signed ranks, Spearman's rank correlation, Kendall's rank correlation, Cuzick's test for trend, confidence intervals for quantiles, Kolmogorov Smirnov two sample test, Ranking and normal scores. ¬REGRESSION AND CORRELATION╪119789 ¬ ~~~~~~~~~~~~~~~~~~~~~~~~~~ Simple linear, general/multiple linear, regression in groups (linearity, differences between regression lines and covariances), polynomial (with area under curve and back interpolation), linearized estimates (exponential, geometric and hyperbolic) and probit analysis (also for logistic curves). ¬ANALYSIS OF VARIANCE╪158578 ¬ ~~~~~~~~~~~~~~~~~~~~ One way, two way, two way with replicates/repeated measures, crossover, Kruskal Wallis and Friedman. ¬SURVIVAL ANALYSIS╪182274 ¬ ~~~~~~~~~~~~~~~~~ Kaplan-Meier product limit estimates of survival and the cumulative hazard function (including plots), simple Berkson-Gage life tables, log-rank and Wilcoxon tests and Wei Lachin. ¬DISTRIBUTIONS╪213522 ¬ ~~~~~~~~~~~~~ Normal, chi-square, Student t, Snedecor's f, Studentized Q, binomial, poisson, Spearman's rho and Kandall's tau. ¬CHI-SQUARE╪218665 ¬ ~~~~~~~~~~ Two by two, two by k with trend, r by c with trend, McNemar's, Mantel Haenszel and Woolf. ¬EXACT╪243294 ¬ ~~~~~ Fisher's, exact (Gart) confidence intervals for two by two odds, Liddel's and the sign test. ¬RANDOMISATION╪252007 ¬ ~~~~~~~~~~~~~ Integer series, case-control pairs and case / control groups. ¬SAMPLE SIZE╪256010 ¬ ~~~~~~~~~~~ For Student t tests, comparison of proportions and population studies. ¬PROPORTIONS╪262904 ¬ ~~~~~~~~~~~ Single, unpaired and paired. ¬MISCELLANEOUS╪269298 ¬ ~~~~~~~~~~~~~ Bayesian (test likelihoods, false result probabilities), relative risk, risk reductions with number needed to treat and standardized mortality ratios. ¬ALGEBRAIC CALCULATOR╪288806 ¬ ~~~~~~~~~~~~~~~~~~~~ Full function algebraic expression evaluator available by pressing Alt+C from any menu or result screen. |Benefits of Registration| Registered users of Arcus are kept informed of developments in the Arcus project by newsletters. Upgrades are offered to registered users at low cost and all registered users can request new functions for Arcus. Each Arcus registration includes a donation to a registered charity and the rest is fed back into further research and development of Arcus. This project is to be supported indefinitely. If you are not a registered Arcus user then you can order your copy of the latest version of Arcus with a clip bound manual by pressing the enter key to select the order form. When the order form is displayed, press E and fill in your details. You can then print out the completed order form. ¬<Order Form>╪308826 ¬ |Order Form| & INVOICE FOR ARCUS PRO-STAT STATISTICAL ANALYSIS SYSTEM Supplier: Medical Computing, Tel UK (0)695 424 034 83, Turnpike Road, FAX UK (0)51 256 7001 Aughton, West Lancs, L39 3LD. United Kingdom Supply to: Post code: What is your intended use for Arcus? If this is a site licence who is the contact for Arcus newsletters? I require (tick one) [ ] 3.5 inch 1.4MB high density diskette [ ] 3.5 inch 720k diskettes [ ] 5.25 inch 360k floppy disks I understand that I Arcus Pro-Stat version 3.0 or later requires at least a 286 processor to run [ ]. Licence fees: Quantity required: Total Price: Single user £ 139 [ ] [ ] Ten user £ 389 [ ] [ ] Twenty user £ 590 [ ] [ ] Fifty user £1200 [ ] [ ] Large site £negotiable [ ] [ ] Postage & Packing: £ 8 for UK [ ] £15 for Non-UK TOTAL [ ] Please make all payments in pounds sterling. Please make cheques payable to Dr Iain E. Buchan. Official Government and University orders are accepted. Convertible cheques in pounds sterling or US money orders are accepted. If you have any questions then please telephone or FAX to the UK numbers listed above. |Reference List| ¬<Introductory Texts>╪310834 ¬───────────∙ref 1 - 3 ¬<Core Reference Texts>╪311139 ¬─────────∙ref 4 - 7 ¬<Other references>╪311556 ¬─────────────∙ref 8 - 31 ¬<Algorithms>╪315734 ¬───────────────────∙ref A1 - A21 |Introductory Texts| 1. Petrie Aviva, Lecture Notes on Medical Statistics, Blackwell Scientific Publications 1990. 2. Bland Martin, An Introduction to Medical Statistics, Oxford Medical Publications 1989. 3. Colton Theodore, Statistics in Medicine, Little, Brown & Co. 1974. |Core Reference Texts| 4. P. Armitage & G. Berry, Statistical Methods in Medical Research, Blackwell 1987. 5 . Altman Douglas G., Practical Statistics for Medical Research, Chapman and Hall 1991. 6. Conover W. J., Practical Nonparametric Statistics, Wiley 1980. 7. Kendall M. G., Stuart A. and Ord J. K., The Advanced Theory of Statistics, (4th edition), London: Griffin 1983. |Other References| 8. Fleiss J., Statistical Methods for Rates and Proportions, Wiley 1981. 9. Fleiss J., J. Chron. Diseases, 32, pp. 69 - 77, 1979. 10. Schlesselman J., Case-Control Studies, Oxford University Press 1982. 11. Gardner Martin J., Altman Douglas G., Statistics with Confidence - Confidence Intervals and Statistical Guidelines, British Medical Journal 1989. 12. Sackett David L. et al., Clinical Epidemiology - a basic science for clinical medicine, Little, Brown & Co. 1985. 13. Wallenstein Sylvian, Some statistical methods useful in circulation research, Circulation Research 47(1) 1980. 14. Wetherill G. Barrie, Intermediate Statistical Methods, Chapman Hall 1981. 15. Hollander Myles, Wolfe Douglas A., Nonparametric Statistical Methods, Wiley 1973. 16. Basic Professional Development System (Compiler 7.1), Microsoft Corporation 1990. 17. FORTRAN Optimising Compiler (version 5.1), Microsoft Corporation 1989. 18. Finney D. J., Probit Analysis, Cambridge University Press 1971. 19. Finney D. J., Statistical Method in Biological Assay, Charles Griffin & Co. 1978. 20. Liddell F. D. K., Simplified exact analysis of case-referent studies; matched pairs; dichotomous exposure., J. Epidemiol. Comm. Health, 37, 82-84, 1983. 21. Shapiro S. S. & Wilk M. B., An analysis of variance test for normality., Biometrika, 52(3), 591 ff., 1965. 22. Miller R. G. (jnr), Simultaneous Statistical Inference, (2nd edition) Springer-Verlag 1981. 23. Draper N. R. and Smith H., Applied Regression Analysis, (2nd edition) New York: Wiley 1981. 24. Lawless J. F., Statistical Models and Methods for Lifetime Data, New York: Wiley 1982. 25. Kalbfleisch J. D. and Prentice R. L., Statistical Analysis of Failure Time Data, New York: Wiley 1980. 26. Wei L. J. and Lachin J. M., Two Sample Asymptotically Distribution Free Tests for Incomplete Multivariate Observations, J. Am. Statist. Ass. 79, 653-661, 1984. 27. Bailey N. T. J., Mathematics, Statistics and Systems for Health, New York: Wiley 1977. 28. Cuzick Jack, A Wilcoxon-Type Test for Trend, Stat. Med. 4, 87-89, 1985. 29. Bland Martin & Altman Douglas, Statistical Methods for Assessing the Difference Between Two Methods of Measurement, Lancet, 307-310, 1986. 30. Dupont W. D., Power and Sample size calculations, Controlled Clinical Trials 11, 116-128, 1990. 31. Pearson & Hartley, Biometrika tables for statisticians, 3rd Ed., Cambridge University Press, 1970. 32. Belsley, Kuh, Welsch, Regression Diagnostics, Wiley 1980. 33. Press W. H. et al., Numerical Recipies, The Art of Scientific Computing, 2rd Ed., Cambridge University Press, 1992. 34. Ross J. G., NonLinear Estimation, Springer-Verlag New York 1990. 35. Gart J. J. & Nam J., Approximate interval estimation of the ratio of binomial parameters: a review and corrections for skewness, Biometrics 44, 323-338, 1988. 36. Sackett David L. et al., Interpretation of diagnostic data (5), Canadian Medical Association Journal, 129, 947-975, 1983. 37. Laupacis A., Sackett D. L., Roberts R. S., An assessment of clinically useful measures of the consequences of treatment, New England J. Med., 318(26), 1728-33, 1988. 38. Haynes Brian & Sackett David, Personal communications on diagnosic and treatment outcome statistics, McMaster University, 1993. 39. Peto R., Pike M. C., Armitage P., Breslow N. E., Cox D. R., Howard S. V., Mantel N., McPherson K., Peto J., Smith P. G., Design and analysis of randomised clinical trials requiring prolonged observation of each patient. Part I: Introduction and design, Br. J. Cancer, 34, 585-612, 1976. 40. Peto R., Pike M. C., Armitage P., Breslow N. E., Cox D. R., Howard S. V., Mantel N., McPherson K., Peto J., Smith P. G., Design and analysis of randomised clinical trials requiring prolonged observation of each patient. Part II: Analysis and Examples, Br. J. Cancer, 34, 585-612, 1976. Published |Algorithms| A1 Pike M. C., Hill I. D., Algorithm 291, Logarithm of the Gamma Function, Comm. Ass. Comput. Mach., 9, 684 1966. A2 Macleod Allan J., AS 245, A Robust and Reliable Algorithm for the Logarithm of the Gamma Function, Appl. Statist. 38(2) 1989. A3 Hill I. D., AS 66, The Normal Integral, Appl. Statist. 22(3) 1973. A4 Odeh R. E., Evans J. O., AS 70, Percentage Points of the Normal Distribution, Appl. Statist. 23 1974. A5 Best D. J., Roberts D. E., AS 91, The Percentage Points of the Chi² Distribution, Appl. Statist. 24(3) 1975. A6 Dinneen L. C., Blakesley B. C., AS 62, A Generator for the Sampling Distribution of the Mann-Whitney U Statistic, Appl. Statist. 22(2) 1973. A7 Majumder K. L., Bhattcharjee G. P., AS 63, The Incomplete Beta Integral, Appl. Statist. 22(3) 1973. A8 Majumder K. L., Bhattcharjee G. P., AS 64, Inverse of the Incomplete Beta Function Ratio, Appl. Statist. 22(3) 1973. A9 Cran G. W., Martin K. J., Thomas G. E., R19 and AS 109 further to AS 63 and AS 64, Appl. Statis. 26(1) 1977. A10 Berry K. J., Mielke P. W., Cran G. W., R83 further to AS 64, Appl. Statist. 39(2) 1990. A11 Lund R. E., Lund J. R., AS 190, Probabilities and Upper Quantiles for the Studentized Range, Appl. Statist. 34 1983. A12 Royston J. P., R69 further to AS 190, Appl. Statist. 1987 A13 Best D. J., Roberts D. E., AS 89, Upper Tail Probabilities of Spearman's Rho, Appl. Statist. 24(3) 1975. A14 Best D. J., Gipps P. G., AS 71, Upper Tail Probabilities of Kendall's Tau, Appl. Statist. 23(1) 1974. A15 Thomas Donald G., AS 36, Exact Confidence Limits for the Odds Ratio in a Two by Two Table, Appl. Statist. 20(1) 1971. A16 Shea B. L., AS 239, Chi-square and incomplete gamma integral, Appl. Statist. 37(3) 1988. A17 Royston J. P., AS 181, The W Test for Normality, Appl. Statist. 31(2) 1982. A18 Royston J. P., AS 177.3, Expected Normal Order Statistics (Approximate), Appl. Statist. 31(2), 1982. A19 Harding E. F., An Efficient Minimal Storage Procedure for Calculating the Mann-Whitney U, Generalised U and Similar Distributions, Appl. Statist. 33 1983. A20 Neumann N., Some Procedures for Calculating the Distributions of Elementary Nonparametric Test Statistics, Statistical Software Newsletter, 14(3) 1988. A21 Makuch Robert et. al., AS 262, A Two Sample Test for Incomplete Multivariate Data, Appl. Statist. 40(1), 1991.